Principal MLOps Platform Engineer in United States at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Principal MLOps Platform Engineer in the United States.
This role sits at the center of building and operating a next-generation AI and MLOps platform designed to support production-grade machine learning and agentic systems at scale. You will be responsible for designing the infrastructure backbone that enables model deployment, observability, orchestration, and cost-efficient runtime operations across cloud environments. The position combines deep cloud engineering, platform architecture, and MLOps expertise, with a strong focus on reliability and automation. You will define how models and LLM-powered services are deployed, monitored, and governed in production. Working across engineering, data, and AI teams, you will ensure seamless integration of ML workflows into scalable, secure, and observable systems. This is a high-impact role where your work directly shapes platform performance, developer experience, and operational efficiency. You will also help establish best practices for cost control, environment management, and production readiness of AI systems.
In this role, you will be responsible for designing, building, and operating the core MLOps platform infrastructure that supports deployment, observability, and lifecycle management of AI and ML systems.
- Build and maintain infrastructure as code using Terraform or AWS CDK to support scalable ML platform environments
- Design and implement CI/CD pipelines using tools such as GitHub Actions, GitLab CI, or AWS CodePipeline
- Establish observability frameworks for ML and LLM systems using CloudWatch, OpenTelemetry, and related tools
- Manage containerized workloads using Docker and orchestration platforms such as ECS Fargate or EKS
- Define and enforce environment isolation strategies, model versioning, and prompt lifecycle management
- Implement monitoring and cost governance mechanisms, including budgets and usage tracking via CloudWatch
- Ensure reliability, scalability, and performance of ML runtime infrastructure across production environments
- Collaborate with AI, data, and engineering teams to integrate ML workflows into platform architecture
- Continuously improve automation, deployment efficiency, and platform developer experience
- Support best practices for secure, compliant, and cost-effective ML operations
The ideal candidate is a highly skilled cloud and platform engineer with strong MLOps experience, deep AWS expertise, and a strong focus on reliability, observability, and scalable infrastructure design.
- 7+ years of experience in platform engineering, DevOps, MLOps, or cloud infrastructure roles
- Deep expertise in AWS, including production-grade architecture and operational management
- Strong experience building infrastructure as code using Terraform or AWS CDK
- Hands-on experience with CI/CD pipelines and modern deployment workflows
- Proven experience with containerization and orchestration (Docker, ECS, EKS, or Kubernetes)
- Strong understanding of observability practices using tools such as CloudWatch and OpenTelemetry
- Experience managing ML or LLM workloads in production environments is highly desirable
- Strong focus on reliability, scalability, security, and cost optimization
- Experience with environment isolation, versioning, and model lifecycle management
- Strong analytical and problem-solving skills in complex distributed systems
- AWS certifications (Solutions Architect Associate or Professional) are preferred
- Kubernetes or CNCF certifications are a plus
- Bachelor’s degree in Computer Science, Information Systems, or related field preferred
This position offers a competitive compensation package along with strong benefits and the opportunity to work on cutting-edge AI infrastructure.
- Salary range: $170,000 – $190,000 annually (OTE, including base and bonus where applicable)
- Comprehensive medical, dental, and vision insurance
- 401(k) retirement savings plan
- Paid time off and company holidays
- Paid parental and caregiver leave
- Remote-friendly work environment (where applicable)
- Access to advanced technology environments and internal engineering labs
- Continuous learning support, including certifications and training opportunities
- Strong culture of inclusion, collaboration, and innovation
- Opportunity to build and scale production AI/ML platforms at enterprise level