Sr Site Reliability Engineer in United States at Jobgether
Explore Related Opportunities
Job Description
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Sr Site Reliability Engineer based in the United States.
This role sits at the core of a large-scale, cloud-based SaaS platform that supports millions of users in the education sector, where reliability and performance directly impact learning outcomes worldwide. You will be responsible for ensuring the availability, scalability, and security of complex distributed systems operating in production environments. The position blends hands-on engineering with strategic influence over SRE practices, observability, and infrastructure modernization. You will work closely with engineering, security, and product teams to reduce downtime, improve system resilience, and optimize deployment pipelines. The environment is fast-paced and highly collaborative, requiring strong problem-solving skills and a proactive approach to incident prevention and response. This is an opportunity to shape SRE maturity while contributing to mission-driven technology at global scale.
- Drive reliability, availability, and observability improvements across large-scale distributed systems supporting cloud-based applications.
- Design, implement, and maintain infrastructure-as-code solutions using tools such as Terraform and CloudFormation.
- Support and enhance CI/CD pipelines to ensure efficient, secure, and reliable software delivery.
- Monitor production systems, investigate incidents, and lead resolution efforts for critical outages and performance issues.
- Collaborate with engineering, security, and operations teams to identify root causes and implement long-term reliability improvements.
- Conduct disaster recovery planning and exercises to validate system resilience and business continuity readiness.
- Contribute to on-call rotations and provide support for off-hours incidents, deployments, and escalations.
- Explore and integrate AI-driven tools to improve SRE workflows, monitoring, alerting, and incident response efficiency.
- Mentor peers and contribute to building a strong engineering culture through technical guidance and feedback.
- 5+ years of experience in Site Reliability Engineering or related infrastructure/DevOps roles.
- Strong experience managing production cloud environments, preferably AWS with Kubernetes (EKS) at scale.
- Hands-on expertise with infrastructure-as-code and configuration tools such as Terraform, Docker, and Ansible.
- Proficiency in at least one programming or scripting language (Python, Java, .NET, JavaScript, or similar).
- Experience building and maintaining CI/CD pipelines in modern engineering environments.
- Strong understanding of monitoring, logging, alerting, and observability best practices.
- Experience with on-call rotations and incident management in production environments.
- Solid understanding of agile development methodologies and cross-functional collaboration.
- Strong analytical and troubleshooting skills with a focus on reliability and system performance.
- Nice to have: experience with SLO/SLI/SLA frameworks, DR exercises, HPC environments, or tools such as Datadog, New Relic, Grafana, PagerDuty, or GitLab/GitHub pipelines.
- Nice to have: exposure to AI-assisted engineering or automation tools in production SRE workflows.
- Competitive compensation package including base salary, typically ranging around $109,500 to $150,550 USD, plus potential performance-based incentives.
- Comprehensive medical, dental, and vision insurance coverage.
- 401(k) and Roth 401(k) retirement plans with company matching.
- Generous paid time off, including vacation, sick leave, and 12 paid holidays.
- Paid parental leave and family support programs.
- Health savings accounts (HSA) and flexible spending accounts (FSA).
- Tuition reimbursement and continuous learning opportunities.
- Remote-first flexibility and support for work-life balance.
- Wellness programs and employee assistance resources.