JobTarget Logo

Senior Site Reliability Engineer in United States at Jobgether

New
Jobgether
United States, United States
Posted on
New job! Apply early to increase your chances of getting hired.

Explore Related Opportunities

Job Description

Senior Site Reliability Engineer

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer in United States.

This role sits at the core of large-scale production reliability, where engineering excellence meets modern AI-driven operations. You will help ensure critical services remain highly available, observable, and performant across complex cloud environments. Acting as a senior technical owner, you will partner closely with product and platform teams to define reliability standards, improve system resilience, and reduce operational toil. The position emphasizes deep production troubleshooting, infrastructure automation, and scalable SRE practices powered by emerging AI tooling. You will also play a key role in shaping how AI agents are safely integrated into engineering workflows. This is a high-impact role where your work directly improves system stability, developer velocity, and customer experience at scale.

Accountabilities

You will be responsible for driving reliability, scalability, and operational excellence across distributed systems in a cloud-native environment.

  • Define and manage service-level objectives (SLOs), error budgets, and reliability metrics in partnership with engineering teams, ensuring alignment with product priorities and system health expectations
  • Investigate and resolve complex production incidents across application, infrastructure, data, and network layers using logs, metrics, traces, and modern AI-assisted debugging techniques
  • Design and implement automation, tooling, and AI-enabled workflows to eliminate operational toil and improve system efficiency
  • Build and maintain infrastructure-as-code, CI/CD pipelines, and production-ready automation in support of scalable service delivery
  • Develop clear technical documentation, including runbooks, postmortems, architecture notes, and operational guidelines for both technical and non-technical stakeholders
  • Collaborate cross-functionally with engineering, product, and operations teams to improve system reliability and influence technical roadmaps
  • Support the safe and effective use of AI agents in production environments by defining guardrails, context, and validation frameworks
Requirements

The ideal candidate brings strong production engineering expertise combined with modern cloud, automation, and AI experience.

  • 5+ years of experience in Site Reliability Engineering, software engineering, or production operations roles
  • Strong hands-on expertise in troubleshooting distributed systems using observability tools (logs, metrics, tracing, profiling)
  • Experience operating production workloads on AWS services such as EC2, S3, EKS, RDS/Aurora, and CloudFront
  • Proficiency in infrastructure-as-code and automation using languages such as Python, Go, or Java
  • Experience with observability platforms such as Grafana, Prometheus, or similar tools
  • Strong understanding of CI/CD pipelines and modern software delivery practices
  • Experience using AI tools and agentic development environments (e.g., Copilot, Cursor, Claude Code) in production workflows
  • Ability to design, document, and communicate complex technical systems clearly to diverse audiences
  • Strong analytical thinking, problem-solving skills, and ability to operate in Agile environments
  • Must be a U.S. citizen physically located in the United States due to contractual requirements
Benefits
  • Competitive base salary range: $113,300 – $205,520 USD (based on location, experience, and qualifications)
  • Performance-based bonus and/or equity opportunities depending on role structure
  • Comprehensive health, dental, and vision insurance
  • 401(k) retirement plan with company contributions
  • Flexible paid time off and paid holidays
  • Remote-first work environment within the United States
  • Life, disability, and supplemental insurance coverage
  • Strong focus on learning, innovation, and adoption of cutting-edge AI technologies
How Jobgether works:
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#LI-CL1

Job Location

United States, United States

Frequently asked questions about this position

Continue to apply
Enter your email to continue. You’ll be redirected to the employer’s application.
By clicking Continue, you understand and agree to JobTarget's Terms of Use and Privacy Policy.