AI Research Engineer (Model Compression & Quantization) in India at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for an AI Research Engineer (Model Compression & Quantization) in India.
This role sits at the forefront of efficient AI systems research, focusing on making large-scale multimodal models practical for real-world deployment. You will work on advancing state-of-the-art techniques in model compression, enabling LLMs and vision-language models to run efficiently on resource-constrained devices such as mobile and edge hardware. The position combines deep research with hands-on engineering, requiring you to design and optimize pipelines that reduce memory usage, latency, and compute cost without sacrificing model performance. You will explore and implement techniques such as quantization, pruning, and knowledge distillation, contributing directly to scalable AI infrastructure. Operating in a highly research-driven and experimental environment, you will collaborate with AI engineers and researchers to push the boundaries of efficient multimodal intelligence. This is a high-impact role for someone passionate about both cutting-edge AI research and real-world deployment constraints.
- Design and implement model compression techniques such as quantization, pruning, and knowledge distillation to optimize large multimodal AI models (LLMs and VLMs) for efficiency and scalability.
- Develop low-bit and mixed-precision quantization strategies to reduce model size and inference latency while preserving accuracy and output quality.
- Build and refine knowledge distillation pipelines to transfer capabilities from large teacher models to compact student models for efficient inference.
- Analyze performance trade-offs between accuracy, latency, memory usage, and throughput across different compression techniques and propose empirical improvements.
- Conduct research on emerging model compression methods and contribute to experimental validation of novel approaches for multimodal architectures.
- Document experiments, methodologies, and findings to ensure reproducibility and effective collaboration across research and engineering teams.
- Contribute to scientific publications and technical papers for leading AI conferences, advancing the field of efficient model deployment.
- PhD or equivalent experience in Computer Science, Machine Learning, NLP, or a related field, with a strong research track record in AI or deep learning.
- Strong hands-on experience with PyTorch or equivalent deep learning frameworks for training and optimizing large-scale models.
- Proven expertise in model quantization, including both Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT).
- Practical experience with knowledge distillation techniques for compressing large neural networks into smaller, efficient models.
- Solid understanding of model pruning methods and neural network optimization strategies for efficiency improvement.
- Deep knowledge of transformer-based architectures (LLMs, VLMs), including training dynamics, backpropagation, fine-tuning, and optimization techniques.
- Strong research mindset with the ability to evaluate trade-offs and design experiments in multimodal AI systems.
- Familiarity with C++ for low-level optimization and inference acceleration is a plus.
- Opportunity to work on cutting-edge AI research focused on efficient multimodal and generative model deployment.
- High-impact role contributing directly to scalable AI systems for real-world edge and mobile applications.
- Fully remote, global-first working environment with international collaboration.
- Strong focus on research freedom, experimentation, and publication in top-tier AI conferences.
- Exposure to advanced AI systems including LLMs, VLMs, and multimodal architectures at scale.
- Competitive compensation aligned with experience and technical expertise.
- Opportunity to shape next-generation AI efficiency standards and deployment techniques.