Sr. Site Reliability Engineer at Jobgether – United States
Explore Related Opportunities
About This Position
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Sr. Site Reliability Engineer in United States.
This role provides a unique opportunity to ensure the stability, scalability, and reliability of critical systems in a fast-paced, cloud-focused environment. The Sr. Site Reliability Engineer will work across engineering, product, and operations teams to embed reliability practices into daily workflows, automate processes, and proactively prevent system issues. This position requires a balance of hands-on technical expertise and strategic thinking to drive infrastructure improvements, optimize operational efficiency, and maintain high service availability. The role offers exposure to modern cloud platforms, containerized environments, and large-scale distributed systems, while giving you a chance to influence reliability standards and incident response practices. Ideal candidates are problem-solvers who enjoy mentoring others, designing resilient systems, and improving operational processes. This position allows you to make a measurable impact on system performance, customer experience, and engineering culture.
- Own and enhance the availability, durability, and performance of production services across all environments
- Lead complex reliability projects from problem identification to resolution, ensuring high-quality technical ownership
- Define and enforce service health standards, including SLIs, SLOs, and error budgets
- Lead critical incident response and post-incident reviews, translating insights into long-term architectural improvements
- Design and implement scalable automation, monitoring, logging, and alerting solutions to reduce manual effort
- Build and maintain infrastructure-as-code, CI/CD pipelines, and operational tools to improve efficiency
- Collaborate with engineering, product, and operations teams to embed reliability practices and guide resilient system design
- Develop operational playbooks, runbooks, and documentation to support continuous improvement and knowledge sharing
Requirements:
- Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience
- 8+ years of progressive experience in site reliability, systems engineering, or operations
- Expert-level Linux administration, advanced troubleshooting, and system security skills
- Deep understanding of distributed systems, container orchestration (Kubernetes/Docker), and microservices architecture
- Proficiency in scripting/programming languages such as Python, Go, or Bash
- Experience with monitoring, logging, and alerting frameworks (Prometheus, Grafana, ELK, Catchpoint)
- Strong familiarity with cloud platforms (AWS, GCP, or Azure) and Hashicorp tools (Terraform, Vault, Nomad)
- Excellent problem-solving, collaboration, and communication skills, with a proactive approach to continuous improvement
- Preferred: ITIL/OSS experience, SaaS or hyper-scale distributed system experience, and a history of mentoring teams on reliability best practices
Benefits:
- Competitive salary in the range of $150,000–$200,000 USD, based on experience and location
- Comprehensive healthcare coverage, including dental and vision for family members
- 401(k) plan with company matching and potential RSU grants
- Flexible vacation policy and parental leave
- Work-from-home support including equipment stipend
- Learning and development programs to grow technical expertise and career trajectory
- Culture that promotes work-life balance and collaborative problem-solving