Senior Site Reliability Engineer in United States at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer in United States.
This is an exciting opportunity for a highly skilled Site Reliability Engineer to help build and scale the reliability foundation of a cutting-edge AI-driven platform. In this role, you will lead strategic reliability initiatives across complex cloud infrastructure, AI workloads, and developer enablement systems. You will work at the intersection of platform engineering, observability, automation, and AI operations, helping teams deliver resilient and scalable services with confidence. The environment is fast-paced, collaborative, and innovation-focused, offering strong technical ownership and leadership influence. Ideal candidates are passionate about cloud-native infrastructure, operational excellence, and enabling high-performing engineering teams. This role is fully remote within the United States and offers the chance to shape reliability practices for next-generation AI-powered systems.
- Own and drive platform reliability initiatives, including defining and managing SLIs, SLOs, and error budgets across production services and AI-driven workloads.
- Design and implement resilient infrastructure patterns for AI pipelines, including observability, failure detection, graceful degradation, and workload isolation.
- Lead incident response processes, disaster recovery planning, and post-incident reviews focused on long-term operational improvements.
- Partner closely with Software Engineering and AI Engineering teams to establish reliability standards, deployment best practices, and scalable CI/CD workflows.
- Develop and maintain observability solutions using monitoring, tracing, logging, and telemetry tools to ensure visibility across services and AI operations.
- Manage infrastructure as code, cloud cost optimization initiatives, and automation strategies to improve operational efficiency and scalability.
- Build and enhance Internal Developer Platforms (IDP), service catalogs, and self-service tooling that empower engineering teams.
- Mentor junior and intermediate engineers, contributing to technical growth, knowledge sharing, and engineering excellence across the organization.
- Bachelor’s degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
- 6–8 years of experience in Site Reliability Engineering, Platform Engineering, or DevOps with demonstrated technical leadership responsibilities.
- Deep expertise with AWS services, Kubernetes, Docker, Terraform, GitOps methodologies, and cloud-native infrastructure patterns.
- Strong experience with observability platforms, distributed tracing, monitoring systems, and operational tooling.
- Proficiency in Python and/or Bash scripting, along with experience supporting microservices architectures and CI/CD pipelines.
- Familiarity with Internal Developer Platform tools such as Backstage or similar solutions is highly desirable.
- Experience supporting AI/ML infrastructure, LLM integrations, or agentic systems is considered a major asset.
- Excellent analytical, communication, mentoring, and problem-solving skills, with the ability to navigate complex technical environments.
- Experience with FinOps, disaster recovery planning, policy-as-code, or regulated environments is a plus.
- Competitive salary range of approximately $149,100 – $157,800 USD
- Comprehensive medical, dental, and vision coverage
- 401(k) matching program
- Flexible vacation policy
- Company-sponsored training and professional development opportunities
- Annual wellness and fitness reimbursement programs
- Inclusive and collaborative remote work environment
- Opportunities for community involvement and charitable engagement
- Access to wellness resources and employee support initiatives
- Occasional travel opportunities for collaboration and team engagement.