Site Reliability Engineer at Jobgether – United States
Explore Related Opportunities
About This Position
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Site Reliability Engineer in the United States.
In this role, you will play a critical part in ensuring the reliability, scalability, and performance of modern, user-facing systems. You’ll work at the intersection of software engineering and operations, building robust infrastructure and driving automation to support high-quality service delivery. The position offers the opportunity to design resilient systems, improve operational efficiency, and proactively address risks before they impact users. You will collaborate closely with cross-functional teams to enhance system design and implement best practices in observability and incident response. This environment values continuous improvement, innovation, and data-driven decision-making. It’s an ideal role for someone who thrives in fast-paced environments and is passionate about building reliable, scalable platforms.
- Ensure high availability, reliability, and scalability of production systems and services
- Develop and maintain automation tools for deployments, configuration management, and operational workflows
- Implement and manage monitoring and alerting systems to provide real-time visibility into system health
- Respond to, troubleshoot, and resolve incidents while conducting post-mortems to prevent recurrence
- Define and monitor Service Level Objectives (SLOs) and performance indicators
- Perform capacity planning and resource forecasting to support system growth
- Collaborate with engineering teams to identify operational risks and improve system architecture
- Analyze system and application metrics to drive performance optimization initiatives
- Minimum of 5 years of experience in IT, software engineering, or technology operations roles
- At least 2 years of hands-on experience in Site Reliability Engineering, DevOps, or observability-focused roles
- Strong expertise in cloud platforms such as AWS or Azure
- Solid understanding of distributed systems, networking, storage, and operating systems
- Experience with infrastructure as code tools (e.g., Terraform) and containerization technologies (e.g., Docker)
- Proficiency with monitoring and observability tools such as DataDog, Prometheus, Grafana, or similar
- Programming or scripting skills in languages such as Python, Ruby, or JavaScript
- Strong problem-solving skills and the ability to work collaboratively across teams
- Excellent communication skills with a proactive and detail-oriented mindset
- Competitive salary with performance-based bonus opportunities
- Comprehensive medical, dental, and vision insurance
- Generous paid time off and company holidays
- 401(k) plan with employer matching contributions
- Paid parental leave and family support programs
- Flexible and collaborative work environment
- Opportunities for professional growth and skill development