JobTarget Logo

Staff Site Reliability Engineer in United States at Jobgether

NewJob Function: Engineering
Jobgether
United States, United States
Posted on
New job! Apply early to increase your chances of getting hired.

Explore Related Opportunities

Job Description

Staff Site Reliability Engineer

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Staff Site Reliability Engineer based in the United States.

This role sits at the intersection of large-scale infrastructure operations, software engineering, and AI-driven systems reliability. You will help ensure the stability, performance, and scalability of complex SaaS platforms used by enterprise customers operating in highly critical domains. A key focus of the role is building and evolving intelligent, AI-assisted reliability tooling that reduces operational toil and accelerates incident resolution. You will own production systems end-to-end, from observability and incident response to long-term architectural improvements. The position blends hands-on engineering with technical leadership, requiring strong judgment in ambiguous, high-impact situations. You will also influence how modern SRE practices are defined and scaled across the organization. The environment is highly collaborative, fast-evolving, and deeply focused on engineering excellence and continuous improvement.

Accountabilities:
  • Lead the design and development of AI-assisted reliability and operations tooling that leverages logs, traces, tickets, and documentation to improve incident diagnosis and resolution speed.
  • Own end-to-end incident response, including detection, mitigation, root cause analysis, and implementation of long-term preventative fixes.
  • Improve observability systems across critical production services by enhancing metrics, logging, tracing, and alerting quality.
  • Define, implement, and evolve SLOs and SLIs to establish measurable reliability standards across key services.
  • Drive improvements in cloud operations for large-scale SaaS deployments, ensuring consistent, repeatable, and reliable customer environments.
  • Build internal tools and automation to reduce operational toil and increase engineering efficiency.
  • Collaborate with product and engineering teams to embed reliability and observability into system design from the outset.
  • Guide and mentor engineers on SRE practices, incident management, and operational excellence.
  • Contribute to the evolution of deployment, upgrade, and operational workflows for distributed systems in production.
Requirements:
  • Extensive experience in Site Reliability Engineering, platform engineering, or production-focused software engineering roles with strong operational ownership.
  • Deep hands-on experience with Kubernetes, Linux systems, and major cloud platforms (AWS, GCP, or Azure).
  • Strong software engineering skills in Python or Go, with a track record of building internal tools or production services.
  • Proven ability to operate, troubleshoot, and optimize complex distributed systems in production environments.
  • Strong expertise in observability practices, including metrics, logging, tracing, and incident response workflows.
  • Experience defining and working with SLOs/SLIs in large-scale systems.
  • Ability to lead technically ambiguous initiatives and influence cross-functional teams without formal authority.
  • Demonstrated success improving system reliability through automation and engineering, not just manual operations.
  • Strong communication skills with experience mentoring engineers or shaping technical practices.
  • Practical judgment in applying AI/LLM tools effectively within operational or engineering workflows.
  • Bonus: experience with SaaS platforms, LLM-based systems, or building tooling for support and developer productivity.
Benefits:
  • Competitive US base salary range: $200,000 – $230,000 annually
  • Equity participation in a high-growth technology company
  • Annual performance bonus or variable compensation (where applicable)
  • Comprehensive medical, dental, and vision insurance coverage
  • 401(k) retirement savings plan
  • Wellness and mental health support programs
  • Flexible remote work environment within the United States
  • Learning and development opportunities to support continuous growth
  • Paid time off and flexible vacation policies
  • Opportunity to work on cutting-edge AI-powered reliability systems at scale.
How Jobgether works:
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#LI-CL1

Job Location

United States, United States

Frequently asked questions about this position

Continue to apply
Enter your email to continue. You’ll be redirected to the employer’s application.
By clicking Continue, you understand and agree to JobTarget's Terms of Use and Privacy Policy.