Can I apply directly for this job on this page?

Yes, you can begin your application on this page using a quick form. You'll then be redirected to the employer's career site to complete the full application process.

What is the role of a Staff Machine Learning Systems Engineer (MLOps) at Jobgether?

The Staff Machine Learning Systems Engineer (MLOps) position at Jobgether is a Full-time or part-time position opportunity in the Engineering field.

Where is this Staff Machine Learning Systems Engineer (MLOps) job located?

United States, Other / Non-US, United States

What type of employment is offered for this Staff Machine Learning Systems Engineer (MLOps) role?

Full-time or part-time position

What is the expected salary for this Staff Machine Learning Systems Engineer (MLOps) job?

Compensation will be discussed during the hiring process.

Staff Machine Learning Systems Engineer (MLOps) job near me in United States, Other / Non-US at Jobgether

Staff Machine Learning Systems Engineer (MLOps)

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Staff Machine Learning Systems Engineer (MLOps) based in the United States.

This is a high-impact infrastructure role focused on building and operating the production systems that power large-scale AI and ML services. You will define how machine learning workloads are deployed, observed, secured, and scaled across cloud-native environments. The role sits at the intersection of platform engineering, DevOps, and applied AI, ensuring that every AI product can be shipped safely and reliably. You will design the underlying Kubernetes-based infrastructure, CI/CD pipelines, and model-serving systems that support mission-critical workloads. Working closely with ML engineers, product teams, and security stakeholders, you will help translate experimental AI capabilities into production-grade systems. This is a hands-on senior technical role for someone who thrives in complex, high-scale, and fast-evolving environments.

Accountabilities:

Lead the design, evolution, and operation of the core ML infrastructure platform supporting AI workloads across production systems, ensuring scalability, reliability, and security across environments.

Own and optimize Kubernetes-based infrastructure (e.g., EKS), including autoscaling, workload orchestration, and cluster lifecycle management for ML and AI systems
Build and maintain GitOps-based CI/CD pipelines enabling safe, repeatable, and efficient deployment of AI services across environments
Design and implement model serving and inference infrastructure, including LLM routing, API gateways, and multi-provider integrations
Develop observability, tracing, and monitoring systems for AI workloads using tools such as OpenTelemetry, Datadog, and LLM tracing platforms
Define and enforce SLOs, incident response processes, and reliability standards for ML systems in production
Own infrastructure-as-code and platform tooling (Terraform, CLIs, internal frameworks) to improve developer velocity and consistency
Drive security, IAM, and secrets management architecture ensuring compliance, least-privilege access, and data protection standards
Collaborate with ML, product, and data teams to translate research and prototypes into production-ready systems
Identify platform bottlenecks and lead initiatives to improve performance, cost efficiency, and deployment speed
Provide technical leadership, mentorship, and architectural guidance across ML systems engineering initiatives

Requirements:

This role requires deep expertise in cloud infrastructure, ML systems, and production-grade platform engineering, with a strong focus on reliability, scalability, and security.

8+ years of experience in platform engineering, DevOps, SRE, or infrastructure roles, including hands-on ML/AI systems experience
Strong expertise with Kubernetes (preferably EKS), including cluster operations, autoscaling, and workload orchestration
Proficiency in infrastructure-as-code tools such as Terraform and experience designing secure cloud architectures
Solid programming skills in Python with experience building infrastructure tooling and automation systems
Experience operating LLM or ML inference systems in production, including routing, serving, and observability
Hands-on experience with observability stacks (Datadog, OpenTelemetry, logging/tracing systems, or equivalents)
Strong understanding of CI/CD systems, GitOps workflows, and developer platform engineering
Experience designing IAM, OIDC, and secrets management systems in cloud environments
Systems-thinking mindset with strong attention to failure modes, reliability, and long-term maintainability
Ability to collaborate across engineering, ML, security, and product teams in fast-paced environments
Experience in regulated or high-compliance environments (healthcare, fintech, or similar) is a plus

Benefits:

Competitive salary with equity opportunities
Comprehensive health coverage including medical, dental, and vision
Unlimited PTO, company holidays, and mental health days
Parental leave and family support benefits
401(k) with employer matching
Employee stock purchase program (ESPP)
Remote-first flexibility and offsite team gatherings
Strong emphasis on wellness, learning, and professional development.

How Jobgether works:

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

#LI-CL1

Staff Machine Learning Systems Engineer (MLOps) in United States at Jobgether

Explore Related Opportunities

Job Description

Scan to Apply

Job Location

Frequently asked questions about this position