Site Reliability Engineer in India at Jobgether
Explore Related Opportunities
Job Description
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Reliability Engineer based in India.
In this role, you will help design, operate, and scale highly available distributed systems that power mission-critical cloud and data platforms. You will work at the intersection of infrastructure, automation, and reliability engineering, ensuring systems remain resilient, observable, and performant under real-world production demands. The environment is fast-paced, cloud-native, and deeply technical, with strong emphasis on Kubernetes-based architectures and modern DevOps practices. You will collaborate closely with engineering, data, and AI/ML teams to support complex workloads across global infrastructure. This role offers the opportunity to solve challenging scalability and performance problems at enterprise scale. It is ideal for engineers who enjoy building automation-first systems and improving reliability through engineering rigor and continuous improvement.
- Operate and optimize containerized environments using Kubernetes and service mesh technologies such as Istio, ensuring high availability and performance across distributed systems.
- Build automation and operational tooling using Go, Python, and Shell scripting to reduce manual intervention and improve system efficiency.
- Design and maintain observability stacks using Prometheus, Grafana, and Loki for proactive incident detection and resolution.
- Troubleshoot and resolve complex issues across networking, storage, and system performance layers in large-scale distributed environments.
- Participate in on-call rotations, incident response, and postmortem analysis to continuously improve reliability and operational maturity.
- Collaborate with AI/ML and data engineering teams to ensure infrastructure readiness for model training, inference workloads, and data pipelines.
- Strong hands-on experience with cloud platforms, particularly Google Cloud, and infrastructure-as-code tools such as Terraform.
- Solid understanding of microservices architectures, containerization, and distributed systems, including production use of Kubernetes and Docker.
- Strong SRE mindset focused on automation, scalability, observability, and reliability engineering principles.
- Practical experience in Linux system administration, networking fundamentals, and security concepts such as PKI and secure service-to-service communication.
- Strong problem-solving skills, ability to work in high-pressure environments, and comfort with incident management and operational ownership.
- Competitive total rewards package aligned with industry standards.
- Fully remote work flexibility with no mandatory office presence.
- Generous training and certification support to accelerate technical growth.
- Dedicated equipment and home-office setup support, including OS choice for your workstation.
- Annual wellness budget supporting fitness, health, and personal well-being.
- Paid vacation, sick leave, and dedicated volunteer time off.
- Exposure to cutting-edge cloud, data, and AI infrastructure environments..