Senior Machine Learning Systems Engineer, Ads ML Experience Platform in United States at Jobgether
Explore Related Opportunities
Job Description
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior Machine Learning Systems Engineer, Ads ML Experience Platform based in the United States.
This role sits at the core of a large-scale machine learning ecosystem powering Ads ML development and experimentation. You will design and build next-generation infrastructure that accelerates the full ML lifecycle, from offline experimentation to production training, evaluation, and deployment. The environment is highly technical, fast-paced, and deeply collaborative, working closely with ML engineers, researchers, and platform teams. You will contribute to systems that enable reproducible research, scalable model iteration, and automated ML workflows. A key focus is advancing developer experience through robust tooling and intelligent automation. The role also explores emerging agentic AI systems that support autonomous and human-in-the-loop workflows. Your work will directly impact the speed, reliability, and scalability of ML innovation across a global platform such as Reddit.
In this role, you will lead the design and development of scalable ML infrastructure that powers experimentation, training, and deployment workflows across Ads ML systems.
- Build and evolve large-scale offline ML experimentation platforms enabling reproducibility, evaluation, and model promotion workflows.
- Develop distributed training orchestration systems supporting hyperparameter tuning, retraining, and evaluation pipelines.
- Design infrastructure for experiment tracking, metadata management, lineage, artifact versioning, and model registries.
- Create automated workflows for model promotion, rollback, compliance validation, and continuous monitoring.
- Collaborate with ML engineers and researchers to improve experimentation velocity and platform efficiency.
- Contribute to the design of agentic AI systems enabling multi-agent orchestration and intelligent workflow execution.
- Ensure systems are reliable, scalable, and optimized for high-performance ML development at production scale.
This role requires strong expertise in large-scale distributed systems and hands-on experience building production-grade ML platforms and infrastructure.
- 5+ years in platform engineering, distributed systems, or large-scale infrastructure development.
- 2+ years building production ML infrastructure, developer platforms, or AI tooling.
- Strong experience with ML workflow orchestration and distributed data processing frameworks (e.g., Spark, Ray, Flink).
- Hands-on experience with orchestration tools such as Airflow, Kubeflow, Argo, or equivalent systems.
- Proven ability to build and maintain ML experimentation platforms, model registries, or training pipelines.
- Strong programming skills in Python and familiarity with scalable software engineering practices.
- Experience with cloud-based ML systems and production deployment environments.
- Exposure to agentic AI systems, multi-agent workflows, or autonomous orchestration frameworks is a strong plus.
- Excellent communication skills with the ability to translate technical complexity into clear insights for diverse stakeholders.
- Competitive base salary with additional equity (RSUs) and potential bonus eligibility
- Comprehensive medical, dental, and vision insurance coverage
- 401(k) retirement plan with employer matching
- Generous paid time off, including vacation, holidays, and parental leave
- Equity participation in a high-growth, impact-driven engineering environment
- Flexible work arrangements with remote eligibility across supported regions
- Professional development opportunities in advanced ML systems and AI infrastructure
- Inclusive, collaborative engineering culture focused on innovation and impact.