Principal ML Engineer, Machine Learning Platform and Systems Architecture in United States at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Principal ML Engineer, Machine Learning Platform and Systems Architecture in United States.
This role is a senior technical leadership position focused on designing and scaling the foundational machine learning systems that power large-scale, production-grade AI applications. You will define and evolve the architecture of ML platforms spanning training, deployment, observability, and data infrastructure, ensuring they are robust, scalable, and efficient. The position sits at the intersection of distributed systems engineering, machine learning infrastructure, and platform strategy, with direct influence on how AI capabilities are delivered into production. You will collaborate closely with researchers, engineers, and product leaders to translate advanced ML concepts into reliable system-level solutions. This is a highly impactful role where you will shape technical direction, solve ambiguous cross-functional challenges, and drive platform excellence across the organization. The environment is remote-friendly, highly collaborative, and focused on building systems that enable cutting-edge innovation at scale.
In this role, you will be responsible for leading the design, development, and evolution of large-scale ML platform and systems architecture supporting end-to-end machine learning workflows.
- Lead architecture and delivery of core ML platform capabilities including training, deployment, evaluation, and observability systems
- Design scalable distributed systems for data processing, feature engineering, model lifecycle management, and production inference
- Own end-to-end technical outcomes for platform initiatives, from architecture design through deployment and operational support
- Develop and scale large data pipelines for structured and semi-structured datasets across distributed environments
- Define and implement frameworks for model deployment, monitoring, observability, and system reliability
- Establish data governance, lineage, and responsible data usage practices across ML infrastructure
- Drive architecture for distributed processing systems using tools such as Ray, Spark, Airflow, or equivalent technologies
- Lead incident response for critical platform issues and implement long-term system improvements
- Mentor engineers, provide technical leadership, and establish best practices for ML system design and operations
- Communicate technical strategies, tradeoffs, and architecture decisions to both technical and non-technical stakeholders
The ideal candidate brings deep expertise in distributed systems, ML infrastructure, and large-scale platform engineering, along with strong technical leadership skills.
- 6–8+ years of experience in software engineering, ML infrastructure, platform engineering, or distributed systems
- Bachelor’s or Master’s degree in Computer Science, Engineering, or equivalent practical experience
- Strong expertise in designing and operating large-scale distributed systems and data platforms
- Advanced proficiency in Python and strong production software engineering practices
- Experience leading complex, cross-functional technical initiatives across multiple engineering teams
- Strong background in ML infrastructure including model deployment, inference systems, and observability frameworks
- Experience with large-scale data pipelines, cloud-native architectures, and distributed processing frameworks
- Ability to make architectural decisions balancing scalability, performance, reliability, and cost
- Strong communication and stakeholder management skills across technical and leadership audiences
- Preferred: experience with Kubernetes, ML orchestration tools, data lineage systems, and ML-ready data representations (graph, geometry, multimodal)
- Competitive base salary ranging from $152,000 to $272,250 depending on experience and location
- Annual cash bonus eligibility, plus stock grants and additional incentive compensation (role dependent)
- Comprehensive health, dental, and vision insurance coverage
- Retirement and financial wellness programs
- Flexible remote work options across the United States and Canada
- Paid time off and wellness-focused benefits supporting work-life balance
- Strong learning and development support for continuous technical growth
- Inclusive, innovation-driven culture focused on collaboration and belonging
- Opportunity to build foundational ML systems powering advanced real-world applications