Cloud Reliability & Recovery Engineer in India at Jobgether
Explore Related Opportunities
Job Description
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Cloud Reliability & Recovery Engineer based in India.
This role sits at the core of large-scale cloud resilience engineering, focused on ensuring critical systems remain highly available, fault-tolerant, and recoverable under any disruption. You will design and operate advanced AWS-based disaster recovery and business continuity architectures across multi-region environments. The position requires deep hands-on engineering expertise in cloud infrastructure, automation, and reliability practices, with a strong emphasis on Kubernetes, Infrastructure as Code, and CI/CD-driven operations. You will work closely with security, infrastructure, and application teams to define and enforce recovery strategies aligned with strict RTO/RPO objectives. This is a highly technical role where you will build automated DR systems, validate resiliency through chaos engineering, and continuously improve platform stability. The environment is fast-paced, engineering-driven, and focused on measurable reliability outcomes at enterprise scale.
Design, implement, and maintain highly resilient cloud architectures with a strong focus on disaster recovery, business continuity, and system availability. Responsibilities include:
- Designing multi-region and multi-AZ AWS architectures aligned with defined RTO/RPO targets
- Building and maintaining failover and failback mechanisms using Route 53, Global Accelerator, and CloudFront
- Developing automated disaster recovery runbooks using AWS Systems Manager, Step Functions, and related services
- Implementing backup and recovery strategies across AWS services including EC2, RDS, S3, DynamoDB, and Aurora
- Automating backup policies, replication workflows, and recovery validation processes
- Performing chaos engineering and resilience testing using AWS Fault Injection Simulator
- Managing Infrastructure as Code using Terraform and/or CloudFormation for DR environments
- Developing CI/CD-driven automation for failover, deployment, and recovery workflows
- Building observability dashboards, alerts, and incident response workflows using CloudWatch and third-party tools
- Participating in on-call rotations, incident response, and post-incident reviews
- Maintaining DR documentation, compliance artifacts, and audit-ready recovery evidence
The ideal candidate brings strong AWS expertise, deep cloud reliability experience, and a proven ability to design and operate large-scale disaster recovery systems.
- 5+ years of experience in cloud infrastructure, SRE, or disaster recovery engineering roles
- 3+ years of hands-on AWS production experience at scale
- Proven experience designing and implementing multi-region DR architectures with defined RTO/RPO
- Strong expertise in AWS services including EC2, RDS, S3, DynamoDB, Aurora, and related resilience tools
- Hands-on experience with Kubernetes-based deployments and cloud-native architecture
- Strong scripting skills in Python, Bash, or PowerShell for automation and orchestration
- Experience with Infrastructure as Code tools such as Terraform or AWS CloudFormation
- Solid understanding of networking concepts including VPC, DNS failover, VPN, and Direct Connect
- Strong knowledge of CI/CD pipelines and automation frameworks
- Excellent communication skills with the ability to produce clear technical and executive reports
- Experience with resilience frameworks, compliance standards, and operational best practices
- Competitive compensation aligned with experience and industry standards
- Opportunity to work on mission-critical, large-scale cloud resilience systems
- Remote-friendly work environment with global collaboration
- Exposure to advanced AWS architectures, DR automation, and chaos engineering practices
- Strong focus on engineering excellence, automation, and continuous improvement
- Learning opportunities in cloud reliability, security, and enterprise-scale infrastructure
- Collaborative environment working with highly skilled engineering and security teams