Site Reliability Engineer - SRE in India at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Site Reliability Engineer (SRE) in India.
This role is a key engineering position focused on ensuring the reliability, scalability, and security of a fast-growing SaaS platform operating in a multi-cloud environment. You will be responsible for designing and maintaining robust infrastructure systems that support mission-critical financial planning applications used by businesses worldwide. The position involves deep collaboration with software engineering teams to embed reliability into every stage of development, from code to production. You will work on automation, observability, and incident response while continuously improving system performance and resilience. Operating in a high-growth, product-driven environment, you will help shape infrastructure strategy and drive operational excellence across complex distributed systems. This is an opportunity to work at the intersection of cloud architecture, DevOps, and reliability engineering at scale.
- Architect, manage, and optimize highly available cloud infrastructure across AWS and GCP environments.
- Design, deploy, and maintain scalable Kubernetes clusters using best practices and configuration tools such as Kustomize.
- Implement and manage service mesh technologies (e.g., Istio, Linkerd) to enhance security, observability, and service communication.
- Build and maintain CI/CD pipelines using Git and Jenkins to improve deployment efficiency and release velocity.
- Develop and maintain Infrastructure as Code (IaC) using Terraform for consistent and secure provisioning of cloud resources.
- Automate operational tasks and reduce system toil through Python scripting and internal tooling.
- Design and improve observability systems using tools such as Prometheus, Grafana, ELK/EFK, CloudWatch, and GCP Operations Suite.
- Lead incident response, postmortems, and reliability initiatives including SLI/SLO/SLA definition and monitoring.
- Collaborate with development teams to ensure systems are designed for scalability, resilience, and production readiness.
- Identify performance bottlenecks and drive continuous improvements in infrastructure and operational processes.
- 5+ years of experience in Site Reliability Engineering, DevOps, or cloud infrastructure roles in SaaS environments.
- Strong hands-on experience with AWS (EC2, EKS, RDS, VPC, IAM, S3) and GCP (GKE, Compute Engine, Cloud SQL, IAM).
- Expert-level knowledge of Kubernetes and Docker, including cluster management and deployment strategies.
- Strong programming skills in Python and extensive experience with Terraform.
- Experience building observability systems using Prometheus, Grafana, and ELK/EFK stacks.
- Solid understanding of cloud networking concepts such as VPC, load balancing, DNS, and routing.
- Strong grasp of security principles including zero-trust architecture in cloud-native systems.
- Experience with CI/CD systems such as Jenkins and Git-based workflows.
- Strong problem-solving skills with the ability to operate in fast-paced, high-scale environments.
- Excellent collaboration and communication skills with a strong DevOps mindset.
- Fully remote-friendly work environment with high autonomy.
- Opportunity to work on multi-cloud, high-scale SaaS infrastructure.
- Strong engineering culture focused on ownership, transparency, and innovation.
- Exposure to modern DevOps, SRE, and cloud-native technologies.
- Collaborative environment with strong emphasis on learning and experimentation.
- Opportunity to directly impact system reliability for global enterprise users.