Can I apply directly for this job on this page?

Yes, you can begin your application on this page using a quick form. You'll then be redirected to the employer's career site to complete the full application process.

What is the role of a Staff Production Operations Engineer at Jobgether?

The Staff Production Operations Engineer position at Jobgether is a Full-time or part-time position opportunity in the Engineering field.

Where is this Staff Production Operations Engineer job located?

United States, Other / Non-US, United States

What type of employment is offered for this Staff Production Operations Engineer role?

Full-time or part-time position

What is the expected salary for this Staff Production Operations Engineer job?

Compensation will be discussed during the hiring process.

Staff Production Operations Engineer job near me in United States, Other / Non-US at Jobgether

Staff Production Operations Engineer

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Staff Production Operations Engineer based in United States.

This role sits at the intersection of reliability engineering, automation, and operational excellence, supporting large-scale distributed systems that process high volumes of real-time data.
You will be responsible for improving system reliability, streamlining incident management workflows, and building automation that reduces operational overhead across engineering teams.
The position plays a key role in shaping how incidents are detected, managed, and learned from, ensuring faster resolution and continuous improvement across production environments.
You will collaborate closely with engineering, product, and customer-facing teams to maintain high availability and performance standards across global systems.
A strong emphasis is placed on leveraging AI-driven tooling to automate repetitive operational tasks and enhance incident response efficiency.
This is a high-impact role ideal for someone who thrives in fast-paced infrastructure environments and enjoys combining SRE discipline with automation and tooling innovation.

Accountabilities:

Drive end-to-end improvements across the incident lifecycle, including alerting quality, severity classification, triage processes, and post-incident follow-ups.
Coordinate on-call programs across distributed teams, including scheduling, onboarding, and ensuring consistent operational coverage.
Lead incident reviews, identify root causes, and ensure actionable follow-ups are tracked and completed effectively.
Build and deploy automation and AI-driven agents to reduce operational toil, including incident summarization and on-call optimization.
Maintain and evolve runbooks, playbooks, and operational documentation to reflect current system behavior and best practices.
Partner with engineering and product teams to improve system observability, reliability metrics, and operational readiness.
Contribute directly to incident resolution when needed by debugging, prototyping fixes, or implementing mitigation strategies.
Improve monitoring, alerting, and observability systems to reduce noise and increase signal quality across production environments.

Requirements:

5+ years of experience in Site Reliability Engineering, DevOps, or Production Operations in large-scale distributed environments.
Strong experience with incident management platforms such as PagerDuty, incident.io, or similar tools.
Hands-on expertise with observability stacks including Datadog, Grafana, CloudWatch, Sentry, or equivalents.
Strong understanding of reliability engineering principles such as SLOs, SLIs, MTTR, MTTA, and error budgets.
Experience building automation, tooling, or systems to reduce operational toil and improve engineering efficiency.
Proficiency in Go or another systems programming language with the ability to contribute to production codebases.
Familiarity with cloud environments (AWS, Azure, or GCP) and infrastructure-as-code practices.
Experience leveraging AI-assisted development tools to improve workflows and operational processes.
Strong written communication skills with the ability to coordinate across teams without direct authority.
Nice to have: experience building AI agents, working with streaming systems like Kafka or Redpanda, or experience in high-scale infrastructure environments.

Benefits:

Competitive compensation aligned with senior infrastructure engineering roles in the US market
Equity participation in a high-growth, innovation-driven technology company
Remote-first work environment across the United States
Comprehensive medical, dental, and vision insurance coverage
Flexible paid time off and paid holidays
Strong emphasis on engineering autonomy, tooling, and modern AI-driven workflows
Opportunity to work on large-scale distributed systems and real-time data infrastructure
Professional growth in a fast-moving, globally distributed engineering organization

How Jobgether works:

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

#LI-CL1

Staff Production Operations Engineer in United States at Jobgether

Explore Related Opportunities

Job Description

Scan to Apply

Job Location

Frequently asked questions about this position