Can I apply directly for this job on this page?

Yes, you can begin your application on this page using a quick form. You'll then be redirected to the employer's career site to complete the full application process.

What is the role of a Staff Site Reliability Engineer at Jobgether?

The Staff Site Reliability Engineer position at Jobgether is a Full-time or part-time position opportunity in the Engineering field.

Where is this Staff Site Reliability Engineer job located?

United States, Other / Non-US, United States

What type of employment is offered for this Staff Site Reliability Engineer role?

Full-time or part-time position

What is the expected salary for this Staff Site Reliability Engineer job?

Compensation will be discussed during the hiring process.

Staff Site Reliability Engineer job near me in United States, Other / Non-US at Jobgether

Staff Site Reliability Engineer

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Staff Site Reliability Engineer based in the United States.

This role sits at the intersection of large-scale infrastructure operations, software engineering, and AI-driven systems reliability. You will help ensure the stability, performance, and scalability of complex SaaS platforms used by enterprise customers operating in highly critical domains. A key focus of the role is building and evolving intelligent, AI-assisted reliability tooling that reduces operational toil and accelerates incident resolution. You will own production systems end-to-end, from observability and incident response to long-term architectural improvements. The position blends hands-on engineering with technical leadership, requiring strong judgment in ambiguous, high-impact situations. You will also influence how modern SRE practices are defined and scaled across the organization. The environment is highly collaborative, fast-evolving, and deeply focused on engineering excellence and continuous improvement.

Accountabilities:

Lead the design and development of AI-assisted reliability and operations tooling that leverages logs, traces, tickets, and documentation to improve incident diagnosis and resolution speed.
Own end-to-end incident response, including detection, mitigation, root cause analysis, and implementation of long-term preventative fixes.
Improve observability systems across critical production services by enhancing metrics, logging, tracing, and alerting quality.
Define, implement, and evolve SLOs and SLIs to establish measurable reliability standards across key services.
Drive improvements in cloud operations for large-scale SaaS deployments, ensuring consistent, repeatable, and reliable customer environments.
Build internal tools and automation to reduce operational toil and increase engineering efficiency.
Collaborate with product and engineering teams to embed reliability and observability into system design from the outset.
Guide and mentor engineers on SRE practices, incident management, and operational excellence.
Contribute to the evolution of deployment, upgrade, and operational workflows for distributed systems in production.

Requirements:

Extensive experience in Site Reliability Engineering, platform engineering, or production-focused software engineering roles with strong operational ownership.
Deep hands-on experience with Kubernetes, Linux systems, and major cloud platforms (AWS, GCP, or Azure).
Strong software engineering skills in Python or Go, with a track record of building internal tools or production services.
Proven ability to operate, troubleshoot, and optimize complex distributed systems in production environments.
Strong expertise in observability practices, including metrics, logging, tracing, and incident response workflows.
Experience defining and working with SLOs/SLIs in large-scale systems.
Ability to lead technically ambiguous initiatives and influence cross-functional teams without formal authority.
Demonstrated success improving system reliability through automation and engineering, not just manual operations.
Strong communication skills with experience mentoring engineers or shaping technical practices.
Practical judgment in applying AI/LLM tools effectively within operational or engineering workflows.
Bonus: experience with SaaS platforms, LLM-based systems, or building tooling for support and developer productivity.

Benefits:

Competitive US base salary range: $200,000 – $230,000 annually
Equity participation in a high-growth technology company
Annual performance bonus or variable compensation (where applicable)
Comprehensive medical, dental, and vision insurance coverage
401(k) retirement savings plan
Wellness and mental health support programs
Flexible remote work environment within the United States
Learning and development opportunities to support continuous growth
Paid time off and flexible vacation policies
Opportunity to work on cutting-edge AI-powered reliability systems at scale.

How Jobgether works:

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

#LI-CL1

Staff Site Reliability Engineer in United States at Jobgether

Explore Related Opportunities

Job Description

Scan to Apply

Job Location

Frequently asked questions about this position