JobTarget Logo

Senior Site Reliability Enigneer in United States at Jobgether

NewJob Function: Information Technology
Jobgether
United States, United States
Posted on
New job! Apply early to increase your chances of getting hired.

Explore Related Opportunities

Job Description

Senior Site Reliability Enigneer

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior Site Reliability Engineer based in United States.

This role sits at the core of a high-scale cloud infrastructure environment powering a leading AI-driven video platform used by global enterprise customers. You will take ownership of operational excellence across critical systems running on AWS, Kubernetes, and supporting services such as MongoDB and workflow orchestration tools. The position blends deep production reliability work with meaningful engineering ownership, focusing on eliminating operational fragility and reducing reliance on individual knowledge. You will be responsible for transforming manual, high-risk processes into automated, resilient systems that scale with the business. Working closely with engineering, infrastructure, and external vendors, you will help define how reliability is achieved at scale. This is a high-impact role for someone who thrives in ownership-heavy environments and enjoys solving complex operational challenges. The environment is fast-moving, highly technical, and deeply collaborative.

Accountabilities:

You will be responsible for ensuring the reliability, scalability, and operational excellence of core cloud infrastructure systems. This includes owning incident response processes, improving monitoring and detection, and driving long-term reductions in system failures and customer-impacting events.

  • Lead incident management activities, including on-call coordination, postmortems, and continuous improvement of response workflows
  • Design and implement automation to eliminate high-risk, low-frequency operational tasks and reduce system fragility
  • Take ownership of key infrastructure domains such as Kubernetes operations, observability systems, or workflow orchestration platforms
  • Manage vendor relationships and external integrations, ensuring reliability, accountability, and reduced operational dependency
  • Drive FinOps initiatives by improving cost visibility, optimizing cloud usage, and aligning infrastructure spend with business needs
  • Collaborate with engineering teams to define reliability standards, operational best practices, and scalable system design patterns
  • Build documentation and operational frameworks that eliminate single points of failure across critical systems
Requirements:

The ideal candidate brings strong hands-on experience in production infrastructure environments, with a focus on reliability engineering, automation, and cloud-native systems. You are comfortable operating in high-scale AWS and Kubernetes environments and have a pragmatic approach to solving operational challenges.

  • 5+ years of experience in Site Reliability Engineering, DevOps, or infrastructure-focused engineering roles in production environments
  • Strong experience with AWS and Kubernetes in large-scale systems, with additional familiarity with MongoDB and distributed systems
  • Proficiency in Python or similar scripting languages for automation and operational tooling
  • Deep understanding of incident management, root cause analysis, and production reliability practices
  • Strong judgment under pressure, with the ability to remain calm and effective during critical incidents
  • Experience working cross-functionally across engineering, infrastructure, and external vendor teams
  • Strong communication skills with the ability to influence through data, clarity, and collaboration rather than escalation
  • Bonus: exposure to FinOps, observability platforms, Temporal, or vendor management in infrastructure environments
Benefits:
  • Competitive base salary with performance-based compensation components
  • Equity participation in a high-growth technology company
  • Comprehensive medical, dental, and vision coverage for employees and eligible dependents
  • Flexible and remote-first working environment
  • Paid time off, parental leave, and company holidays
  • Learning and development budget to support continuous skill growth
  • Modern cloud infrastructure environment with opportunities to work on large-scale distributed systems
  • Exposure to cutting-edge AI infrastructure and enterprise-grade production systems
How Jobgether works:
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#LI-CL1

Job Location

United States, United States

Frequently asked questions about this position

Continue to apply
Enter your email to continue. You’ll be redirected to the employer’s application.
By clicking Continue, you understand and agree to JobTarget's Terms of Use and Privacy Policy.