Senior AIOps Engineer in India at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior AIOps Engineer I in India.
This role sits at the intersection of AI, machine learning, and platform reliability, focusing on ensuring that production AI systems operate efficiently, securely, and at scale. You will be responsible for maintaining and improving the operational health of AI/ML-powered services running in production environments. The position involves working closely with data scientists, ML engineers, and platform teams to ensure smooth deployment, monitoring, and lifecycle management of AI models. You will play a key role in building observability, automation, and infrastructure that supports reliable AI delivery. The environment is highly collaborative and fast-evolving, with a strong emphasis on scalability, cost optimization, and production readiness. This is a hands-on engineering role where your work directly impacts the stability and performance of AI-driven products used at scale.
- Own the reliability, availability, and performance of AI/ML services in production environments.
- Define and maintain SLOs/SLIs for AI systems, ensuring alignment with user experience and business outcomes.
- Monitor, detect, and mitigate model drift, performance degradation, and system issues in production.
- Design and implement observability solutions including monitoring, logging, alerting, and dashboards for AI systems.
- Support deployment workflows for ML models, including canary, blue/green, and A/B testing strategies.
- Operate and improve AI infrastructure components such as model serving systems, LLM gateways, and RAG pipelines.
- Manage CI/CD pipelines and automation to improve deployment reliability and reduce operational overhead.
- Participate in incident management, on-call rotations, and post-incident reviews to improve system resilience.
- Collaborate with cross-functional teams to ensure scalable, secure, and cost-efficient AI operations.
- 4+ years of software engineering experience, including at least 3 years in production systems, SRE, DevOps, or platform engineering roles.
- Strong experience operating distributed systems on Kubernetes and cloud platforms.
- Hands-on experience with Google Cloud Platform services such as GKE, BigQuery, Pub/Sub, Vertex AI, Cloud SQL, and GCS.
- Solid understanding of CI/CD pipelines, infrastructure-as-code (Terraform preferred), and deployment automation.
- Experience with monitoring, logging, and observability tools such as Datadog, Prometheus, Grafana, or ELK stack.
- Familiarity with containerization and Docker image lifecycle management.
- Understanding of ML lifecycle concepts including training, deployment, evaluation, and monitoring.
- Exposure to AI/ML tooling such as LLM gateways, vector databases, RAG systems, or embedding pipelines is a strong plus.
- Strong Python programming skills and solid software engineering fundamentals.
- Excellent communication skills with the ability to work across technical and non-technical stakeholders.
- Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
- Competitive annual salary aligned with experience and market standards.
- Fully remote work with structured overlap hours for global collaboration.
- Comprehensive health, accident, and retirement benefits.
- Paid holidays, generous leave policies, and wellness programs.
- Exposure to cutting-edge AI/ML infrastructure and large-scale production systems.
- Strong culture of learning, ownership, and cross-functional collaboration.
- Opportunity to work on high-impact AI systems used in real-world production environments.
- Inclusive and globally distributed team environment.