Can I apply directly for this job on this page?

Yes, you can begin your application on this page using a quick form. You'll then be redirected to the employer's career site to complete the full application process.

What is the role of a Site Reliability Engineer - AI Agents at Jobgether?

The Site Reliability Engineer - AI Agents position at Jobgether is a Full-time or part-time position opportunity in the Engineering field.

Where is this Site Reliability Engineer - AI Agents job located?

United States, Other / Non-US, United States

What type of employment is offered for this Site Reliability Engineer - AI Agents role?

Full-time or part-time position

What is the expected salary for this Site Reliability Engineer - AI Agents job?

Compensation will be discussed during the hiring process.

Site Reliability Engineer - AI Agents job near me in United States, Other / Non-US at Jobgether | Jobs and Employment

Site Reliability Engineer - AI Agents

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Reliability Engineer – AI Agents based in the United States.

This is a highly technical platform engineering role focused on building and operating the infrastructure that powers production-grade AI agent systems at scale. You will work at the intersection of SRE, MLOps, and platform engineering, ensuring that agentic workflows are reliable, observable, and performant across both internal tools and external-facing products. The role involves designing and maintaining cloud-native infrastructure, enabling seamless orchestration and execution of AI workloads in production environments. You will also contribute to developer platform capabilities, building APIs, SDKs, and self-service tools that allow engineering and AI teams to efficiently consume infrastructure services. The environment is fast-paced and innovation-driven, requiring strong ownership, operational discipline, and comfort working with rapidly evolving AI technologies. This position offers the opportunity to shape foundational systems powering next-generation AI agent infrastructure.

Accountabilities:

You will be responsible for designing, operating, and scaling resilient infrastructure systems that support AI agent workloads in production, ensuring reliability, scalability, and developer usability across the platform.

Design, build, and operate cloud-native infrastructure supporting AI agent execution, orchestration, and model serving at scale
Ensure reliability, observability, and performance of distributed agentic systems across internal and external-facing products
Develop platform services, APIs, SDKs, and self-service tooling to enable teams to efficiently consume AI infrastructure capabilities
Manage and optimize compute, orchestration, and serving layers for AI and ML workloads in production environments
Build and maintain CI/CD pipelines to enable safe, fast, and reliable deployment of AI services and agent workflows
Implement Infrastructure as Code using tools such as Terraform to provision and manage AWS-based infrastructure
Design monitoring, alerting, and observability systems tailored to AI/ML and agent-based workloads
Define and enforce reliability patterns, guardrails, and failure recovery mechanisms for LLM and agentic systems
Collaborate with AI, Data Engineering, and Product teams to transform experimental prototypes into production-ready systems
Manage Kubernetes-based container orchestration environments, ensuring scalable and efficient workload deployment
Implement security best practices and access controls across infrastructure and platform services
Document system architecture, operational procedures, and runbooks to support team knowledge sharing and reliability

Requirements

The ideal candidate is a strong platform-minded engineer with deep SRE experience, a solid understanding of cloud-native systems, and exposure to AI/ML infrastructure or agent-based systems.

5+ years of experience in Site Reliability Engineering, Platform Engineering, Infrastructure Engineering, or similar production-focused roles
Hands-on experience supporting ML systems, model serving infrastructure, or MLOps pipelines in production environments
Strong experience building developer platforms, internal tools, APIs, or SDKs used by engineering teams at scale
Deep understanding of platform engineering principles, including self-service infrastructure and developer experience design
Strong proficiency with Infrastructure as Code tools, particularly Terraform
Advanced experience with Kubernetes and containerized environments (Docker)
Solid cloud infrastructure experience, preferably within AWS environments
Strong programming and scripting skills (Python preferred, plus bash/shell proficiency)
Experience designing and operating observability, logging, monitoring, and alerting systems
Proven experience with incident response, on-call rotations, and production reliability ownership
Strong cross-functional collaboration skills across AI, data, and engineering teams
High ownership mindset with the ability to operate in fast-moving, high-stakes production environments
Familiarity with AI/agent systems, orchestration frameworks, or LLM-based applications is a strong plus

Benefits

Competitive compensation package with performance-based incentives
Remote-first working model across multiple eligible countries
Comprehensive medical, dental, and vision insurance coverage (where applicable)
Retirement savings plans with employer contribution options
Flexible PTO policy and company holidays
Mental health support and wellness programs
Learning and development budget for technical and professional growth
Opportunities to work on cutting-edge AI agent infrastructure at global scale
Inclusive, distributed engineering culture with strong emphasis on ownership and impact
Regular opportunities to collaborate with high-performing AI and platform engineering teams.

How Jobgether works:

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

#LI-CL1

Site Reliability Engineer - AI Agents in United States at Jobgether

Explore Related Opportunities

Job Description

Scan to Apply

Job Location

Frequently asked questions about this position