Cloud Reliability & Recovery Engineer in India at Jobgether
Explore Related Opportunities
Job Description
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Cloud Reliability & Recovery Engineer based in India.
This is a senior, hands-on cloud engineering role focused on building and maintaining highly resilient, always-available AWS environments. You will design and operate large-scale disaster recovery (DR) and business continuity (BCP) frameworks that ensure critical systems remain operational even during major disruptions. The role sits at the intersection of SRE, infrastructure engineering, and incident response, with a strong emphasis on automation, fault tolerance, and cloud-native architecture. You will work extensively with Kubernetes, Terraform, and AWS-native resilience services to engineer multi-region failover and recovery strategies. The environment is fast-paced, security-conscious, and highly collaborative, involving close partnership with infrastructure, security, and application teams. Your work will directly reduce downtime risk and strengthen global service reliability across mission-critical systems.
- Design and implement highly available, multi-region and multi-AZ AWS architectures aligned with defined RTO/RPO objectives, ensuring system continuity under failure scenarios.
- Build and maintain disaster recovery (DR) solutions including automated failover/failback mechanisms using services such as Route 53, Global Accelerator, CloudFront, and AWS Systems Manager.
- Develop and execute backup, restore, and data replication strategies across AWS services (RDS, DynamoDB, S3, EFS, Aurora), ensuring integrity and recoverability.
- Implement infrastructure as code using Terraform or CloudFormation to standardize and automate DR-ready environments.
- Create and maintain CI/CD-driven DR testing pipelines, including chaos engineering practices to validate system resilience under real-world failure conditions.
- Monitor system availability and resilience using CloudWatch, incident tooling, and AWS health services, participating in on-call rotations and leading incident response efforts.
- Conduct DR drills, tabletop exercises, and post-incident reviews to continuously improve recovery readiness and compliance posture.
- 5+ years of experience in cloud engineering, SRE, infrastructure, or disaster recovery roles, with at least 3+ years in AWS production environments at scale.
- Proven experience designing and operating multi-region disaster recovery architectures with measurable RTO/RPO outcomes.
- Strong expertise in AWS services related to resilience, including networking (VPC, DNS, VPN, Direct Connect) and storage/database replication.
- Hands-on experience with Infrastructure as Code tools such as Terraform and/or CloudFormation.
- Proficiency in scripting and automation using Python, Bash, or PowerShell.
- Solid understanding of Kubernetes-based deployments, including scaling, self-healing, and multi-cluster strategies.
- Experience with CI/CD tools and practices (e.g., GitHub Actions, CodePipeline, CodeBuild).
- Strong communication skills with the ability to document DR strategies and present technical risks and recovery plans clearly.
- Preferred: AWS certifications (Solutions Architect – Professional, DevOps Engineer – Professional, Advanced Networking Specialty).
- Competitive compensation package aligned with senior-level cloud engineering roles.
- Opportunity to work on large-scale, mission-critical cloud infrastructure with global impact.
- Flexible and remote-friendly work arrangements (depending on team policy).
- Strong focus on learning and upskilling in advanced AWS, resilience engineering, and cloud architecture.
- Exposure to modern engineering practices including chaos engineering, SRE methodologies, and GitOps workflows.
- Collaborative, high-autonomy environment with strong engineering ownership.
- Health, wellness, and standard employee benefits in line with industry benchmarks.