Human Data Evals Lead in United States at Jobgether
Explore Related Opportunities
Job Description
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Human Data Evals Lead based in United States.
This role sits at the core of frontier AI data operations, owning how high-quality evaluation datasets and benchmarks are designed, validated, and delivered to leading AI labs. You will be responsible for translating ambiguous evaluation needs into structured, high-signal data proposals and production-ready sample packages that demonstrate model performance with rigor and clarity. The work blends technical judgment, quality design, and commercial awareness, requiring close collaboration with subject-matter experts and research stakeholders. You will shape how “frontier-grade” quality is defined and enforced, ensuring every dataset meets the standards expected by advanced model developers. Acting as a key interface with AI lab partners, you will help convert pilots into scaled production engagements. This is a high-ownership role at the intersection of AI evaluation, data quality, and applied research operations.
Own the design, development, and delivery of high-quality AI evaluation data initiatives, from initial proposals through pilot execution and production readiness.
- Develop data proposals and sample packages based on lab requests, benchmarks, and evaluation targets, translating them into structured, high-signal datasets.
- Design frontier-grade evaluation samples across reasoning, coding, agents, tool use, and multimodal tasks, ensuring measurable model discrimination and headroom.
- Define and enforce rigorous quality control frameworks, including expert verification, calibration layers, rubrics, and deterministic validation approaches.
- Recruit, onboard, and manage subject-matter experts across technical domains, ensuring consistent output quality aligned with benchmark standards.
- Own pilot engagements end-to-end, including scoping, staffing, SOW definition, QC execution, and final delivery to AI lab partners.
- Act as a key point of contact for lab stakeholders, aligning expectations and surfacing technical requirements in collaboration with internal leadership.
- Continuously refine evaluation methodologies and sample design standards to improve signal quality and benchmark reliability.
You are an experienced operator in AI evaluation or technical delivery, with strong expertise in building structured, high-quality data systems for model assessment.
- 5+ years of experience in technical program management, data operations, quality engineering, or ML evaluation roles.
- Proven experience working with AI labs or enterprise ML teams, delivering datasets, benchmarks, or evaluation frameworks.
- Strong understanding of LLM evaluation concepts such as benchmarks, rubrics, pass rates, headroom, and model discrimination.
- Hands-on experience designing or managing QC processes and ensuring high-quality annotated or evaluated datasets.
- Demonstrated ability to recruit, manage, and calibrate subject-matter experts or external contributor pools.
- Strong problem-solving skills in ambiguous environments with evolving requirements and fast iteration cycles.
- Excellent English communication skills; Spanish is a plus.
- Competitive compensation aligned with senior-level AI and data roles
- Remote-first setup with flexibility across LATAM and US time zones
- Opportunity to work directly with leading AI labs and frontier model development teams
- High-ownership role with significant influence over evaluation standards and methodologies
- Collaboration with top-tier subject-matter experts across technical domains
- Exposure to cutting-edge AI benchmarking and evaluation practices
- Fast-paced, research-driven environment with strong learning potential
- Opportunity to shape how frontier model quality is measured and improved