JobTarget Logo

Staff Machine Learning Systems Engineer (MLOps) in United States at Jobgether

NewJob Function: Engineering
Jobgether
United States, United States
Posted on
New job! Apply early to increase your chances of getting hired.

Explore Related Opportunities

Job Description

Staff Machine Learning Systems Engineer (MLOps)

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Staff Machine Learning Systems Engineer (MLOps) based in the United States.

This is a high-impact infrastructure role focused on building and operating the production systems that power large-scale AI and ML services. You will define how machine learning workloads are deployed, observed, secured, and scaled across cloud-native environments. The role sits at the intersection of platform engineering, DevOps, and applied AI, ensuring that every AI product can be shipped safely and reliably. You will design the underlying Kubernetes-based infrastructure, CI/CD pipelines, and model-serving systems that support mission-critical workloads. Working closely with ML engineers, product teams, and security stakeholders, you will help translate experimental AI capabilities into production-grade systems. This is a hands-on senior technical role for someone who thrives in complex, high-scale, and fast-evolving environments.

Accountabilities:

Lead the design, evolution, and operation of the core ML infrastructure platform supporting AI workloads across production systems, ensuring scalability, reliability, and security across environments.

  • Own and optimize Kubernetes-based infrastructure (e.g., EKS), including autoscaling, workload orchestration, and cluster lifecycle management for ML and AI systems
  • Build and maintain GitOps-based CI/CD pipelines enabling safe, repeatable, and efficient deployment of AI services across environments
  • Design and implement model serving and inference infrastructure, including LLM routing, API gateways, and multi-provider integrations
  • Develop observability, tracing, and monitoring systems for AI workloads using tools such as OpenTelemetry, Datadog, and LLM tracing platforms
  • Define and enforce SLOs, incident response processes, and reliability standards for ML systems in production
  • Own infrastructure-as-code and platform tooling (Terraform, CLIs, internal frameworks) to improve developer velocity and consistency
  • Drive security, IAM, and secrets management architecture ensuring compliance, least-privilege access, and data protection standards
  • Collaborate with ML, product, and data teams to translate research and prototypes into production-ready systems
  • Identify platform bottlenecks and lead initiatives to improve performance, cost efficiency, and deployment speed
  • Provide technical leadership, mentorship, and architectural guidance across ML systems engineering initiatives
Requirements:

This role requires deep expertise in cloud infrastructure, ML systems, and production-grade platform engineering, with a strong focus on reliability, scalability, and security.

  • 8+ years of experience in platform engineering, DevOps, SRE, or infrastructure roles, including hands-on ML/AI systems experience
  • Strong expertise with Kubernetes (preferably EKS), including cluster operations, autoscaling, and workload orchestration
  • Proficiency in infrastructure-as-code tools such as Terraform and experience designing secure cloud architectures
  • Solid programming skills in Python with experience building infrastructure tooling and automation systems
  • Experience operating LLM or ML inference systems in production, including routing, serving, and observability
  • Hands-on experience with observability stacks (Datadog, OpenTelemetry, logging/tracing systems, or equivalents)
  • Strong understanding of CI/CD systems, GitOps workflows, and developer platform engineering
  • Experience designing IAM, OIDC, and secrets management systems in cloud environments
  • Systems-thinking mindset with strong attention to failure modes, reliability, and long-term maintainability
  • Ability to collaborate across engineering, ML, security, and product teams in fast-paced environments
  • Experience in regulated or high-compliance environments (healthcare, fintech, or similar) is a plus
Benefits:
  • Competitive salary with equity opportunities
  • Comprehensive health coverage including medical, dental, and vision
  • Unlimited PTO, company holidays, and mental health days
  • Parental leave and family support benefits
  • 401(k) with employer matching
  • Employee stock purchase program (ESPP)
  • Remote-first flexibility and offsite team gatherings
  • Strong emphasis on wellness, learning, and professional development.
How Jobgether works:
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#LI-CL1

Job Location

United States, United States

Frequently asked questions about this position

Continue to apply
Enter your email to continue. You’ll be redirected to the employer’s application.
By clicking Continue, you understand and agree to JobTarget's Terms of Use and Privacy Policy.