Site Reliability Engineer - SRE in India at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Site Reliability Engineer (SRE) in India.
This role offers the opportunity to shape and scale the reliability backbone of a fast-growing SaaS platform operating in a high-growth, product-led environment. You will ensure high availability, performance, and security across complex cloud-native systems serving financial planning and decision-making use cases. Working in a remote-first and highly collaborative culture, you will partner closely with engineering teams to embed reliability into every stage of the software lifecycle. The position requires ownership of multi-cloud infrastructure and a strong focus on automation and observability. You will contribute to building resilient systems that can scale with rapid business growth while maintaining strict security and compliance standards. This is a hands-on role where engineering excellence directly impacts customer trust and product success.
- Design, manage, and optimize scalable multi-cloud infrastructure across AWS and GCP, ensuring high availability, cost efficiency, and security compliance.
- Lead Kubernetes orchestration, including cluster design, deployment strategies, and configuration management for consistent and reliable environments.
- Implement and maintain service mesh solutions to secure and monitor service-to-service communication across distributed systems.
- Build and optimize CI/CD pipelines using Git and Jenkins, improving deployment speed, reliability, and automated testing coverage.
- Develop Infrastructure as Code (Terraform) to provision and manage cloud resources in a repeatable and version-controlled manner.
- Drive automation initiatives using Python to reduce operational toil, streamline maintenance, and improve system resilience.
- Own observability systems using tools like Prometheus, Grafana, ELK/EFK, CloudWatch, and GCP Operations Suite to ensure full system visibility.
- Lead incident response, postmortems, and reliability engineering practices, defining and tracking SLIs, SLOs, and SLAs.
- Collaborate with development teams to embed DevOps and reliability best practices into application design and delivery.
- 5+ years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles in a SaaS or high-scale environment.
- Strong expertise in AWS (EC2, EKS, RDS, VPC, IAM, S3) and GCP (GKE, Compute Engine, Cloud SQL, IAM, Cloud Storage).
- Advanced knowledge of Kubernetes and Docker, including deployment, scaling, and lifecycle management.
- Solid experience with Terraform and Infrastructure as Code principles.
- Strong programming skills in Python for automation and tooling development.
- Hands-on experience with observability stacks including Prometheus, Grafana, and ELK/EFK.
- Deep understanding of cloud networking, distributed systems, and security best practices (zero-trust, IAM, RBAC).
- Experience building CI/CD pipelines and working with Git-based workflows.
- Strong problem-solving skills, ownership mindset, and ability to thrive in fast-paced, remote-first environments.
- Fully remote-first work environment with high autonomy and flexibility.
- Opportunity to work on modern, cloud-native, high-scale infrastructure.
- Competitive compensation aligned with experience and market standards.
- Culture of openness, transparency, and idea-driven innovation.
- Strong focus on learning, experimentation, and professional growth.
- Collaborative, engineering-led environment with high ownership and impact.
- Exposure to cutting-edge technologies in multi-cloud, Kubernetes, and observability stacks.