Senior Site Reliability Enigneer in United States at Jobgether
Explore Related Opportunities
Job Description
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior Site Reliability Engineer based in United States.
This role sits at the core of a high-scale cloud infrastructure environment powering a leading AI-driven video platform used by global enterprise customers. You will take ownership of operational excellence across critical systems running on AWS, Kubernetes, and supporting services such as MongoDB and workflow orchestration tools. The position blends deep production reliability work with meaningful engineering ownership, focusing on eliminating operational fragility and reducing reliance on individual knowledge. You will be responsible for transforming manual, high-risk processes into automated, resilient systems that scale with the business. Working closely with engineering, infrastructure, and external vendors, you will help define how reliability is achieved at scale. This is a high-impact role for someone who thrives in ownership-heavy environments and enjoys solving complex operational challenges. The environment is fast-moving, highly technical, and deeply collaborative.
You will be responsible for ensuring the reliability, scalability, and operational excellence of core cloud infrastructure systems. This includes owning incident response processes, improving monitoring and detection, and driving long-term reductions in system failures and customer-impacting events.
- Lead incident management activities, including on-call coordination, postmortems, and continuous improvement of response workflows
- Design and implement automation to eliminate high-risk, low-frequency operational tasks and reduce system fragility
- Take ownership of key infrastructure domains such as Kubernetes operations, observability systems, or workflow orchestration platforms
- Manage vendor relationships and external integrations, ensuring reliability, accountability, and reduced operational dependency
- Drive FinOps initiatives by improving cost visibility, optimizing cloud usage, and aligning infrastructure spend with business needs
- Collaborate with engineering teams to define reliability standards, operational best practices, and scalable system design patterns
- Build documentation and operational frameworks that eliminate single points of failure across critical systems
The ideal candidate brings strong hands-on experience in production infrastructure environments, with a focus on reliability engineering, automation, and cloud-native systems. You are comfortable operating in high-scale AWS and Kubernetes environments and have a pragmatic approach to solving operational challenges.
- 5+ years of experience in Site Reliability Engineering, DevOps, or infrastructure-focused engineering roles in production environments
- Strong experience with AWS and Kubernetes in large-scale systems, with additional familiarity with MongoDB and distributed systems
- Proficiency in Python or similar scripting languages for automation and operational tooling
- Deep understanding of incident management, root cause analysis, and production reliability practices
- Strong judgment under pressure, with the ability to remain calm and effective during critical incidents
- Experience working cross-functionally across engineering, infrastructure, and external vendor teams
- Strong communication skills with the ability to influence through data, clarity, and collaboration rather than escalation
- Bonus: exposure to FinOps, observability platforms, Temporal, or vendor management in infrastructure environments
- Competitive base salary with performance-based compensation components
- Equity participation in a high-growth technology company
- Comprehensive medical, dental, and vision coverage for employees and eligible dependents
- Flexible and remote-first working environment
- Paid time off, parental leave, and company holidays
- Learning and development budget to support continuous skill growth
- Modern cloud infrastructure environment with opportunities to work on large-scale distributed systems
- Exposure to cutting-edge AI infrastructure and enterprise-grade production systems