Senior Site Reliability Engineer in United States at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer in United States.
This role is focused on ensuring the reliability, scalability, and performance of a modern, cloud-native platform that supports privacy, security, and data-driven services at enterprise scale. You will act as a senior technical owner of production stability, working closely with engineering, security, and developer experience teams to embed strong reliability practices across the software lifecycle. The environment is fast-moving and highly collaborative, requiring a balance of hands-on engineering and strategic thinking. You will help define and evolve SRE standards, turning incidents and operational learnings into long-term systemic improvements. This is a high-impact position where your work directly influences platform resilience, customer experience, and engineering efficiency. It offers the opportunity to shape observability, incident response, and infrastructure strategy in a remote-first organization.
- Lead reliability design and production readiness reviews for services, ensuring strong observability, safe deployments, and rollback strategies
- Build, operate, and improve observability systems including logging, metrics, tracing, dashboards, alerts, and runbooks for incident response
- Own incident management processes, including on-call participation, escalation handling, post-incident reviews, and long-term remediation tracking
- Design and execute disaster recovery testing, game days, and resilience exercises to validate system robustness and reduce failure points
- Perform capacity planning and cloud cost optimization to ensure scalable, efficient, and high-performing infrastructure
- Identify systemic reliability risks and drive cross-team initiatives to reduce incidents and improve platform stability
- Collaborate with engineering and security teams to integrate reliability practices into CI/CD pipelines, tooling, and development workflows
- Continuously improve on-call operations, automation, alerting quality, and operational documentation
- 5+ years of experience in Site Reliability Engineering, Production Engineering, Infrastructure Engineering, or similar production-focused roles
- Strong hands-on experience with cloud infrastructure (ideally AWS), including compute, networking, storage, and security services
- Proficiency in at least one programming language such as Python, JavaScript, or TypeScript, with ability to review and understand production code
- Experience with infrastructure as code and CI/CD tools such as Terraform, CloudFormation, or equivalent platforms
- Deep knowledge of observability tools (e.g., Datadog or similar), including alert design, monitoring strategies, and incident signal management
- Proven experience leading incident response, root cause analysis, and postmortem processes with actionable outcomes
- Strong communication and collaboration skills, with ability to influence across engineering teams without formal authority
- Experience participating in or improving on-call rotations, escalation workflows, and operational readiness practices
- Bachelor’s degree in a technical field or equivalent practical experience
- Ability to thrive in a remote, high-autonomy environment with strong ownership and execution discipline
- Competitive salary aligned with experience and location
- Equity participation as part of total compensation package
- Flexible remote-first work environment
- Comprehensive health, dental, and vision insurance
- 401(k) retirement plan with company match
- Flexible PTO and paid parental leave
- Home office support and remote work stipend
- Strong learning culture with growth and development opportunities.