Technical Product Manager - AI Compute Platform in Germany at Jobgether
Explore Related Opportunities
Job Description
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Technical Product Manager - AI Compute Platform based in Germany.
In this role, you will help shape and scale a next-generation AI cloud platform powering some of the most demanding machine learning workloads in the world. You will own critical parts of a hyperscale infrastructure product, working at the intersection of engineering, customers, and platform strategy. This is a deeply technical product management role where you will collaborate as a peer with senior engineers on topics such as GPU orchestration, distributed training, cluster operations, and cloud APIs. You will translate complex customer needs into robust platform capabilities that enable large-scale AI training and inference. The environment is fast-paced, highly technical, and mission-driven, with direct impact on how frontier AI systems are built and deployed. This role is ideal for someone who thrives in ambiguity, enjoys solving infrastructure-scale problems, and wants to define the future of AI compute platforms.
- Own end-to-end product strategy, roadmap, and execution for a critical slice of an AI compute platform, ensuring alignment with customer and business outcomes.
- Define and evolve platform contracts such as APIs, system behaviors, lifecycle semantics, and developer-facing interfaces at hyperscaler quality.
- Lead cross-functional execution across engineering, SRE, networking, storage, observability, IAM, billing, capacity planning, and customer-facing teams.
- Drive structured product discovery through customer interviews, usage analytics, incident analysis, and support feedback loops.
- Translate complex technical and operational challenges into clear product requirements and measurable success metrics.
- Collaborate as a technical peer with engineering teams to evaluate architecture decisions, system trade-offs, and platform design quality.
- Own adoption and performance of shipped features, ensuring continuous improvement based on real-world usage and telemetry.
- Serve as the escalation point for customer-facing teams on product behavior, system reliability, and platform design decisions.
- Define success metrics tied to customer impact, platform efficiency, and operational excellence rather than output-based delivery.
- 6+ years of experience in Product Management, Platform/Product Infrastructure roles, or equivalent experience in SRE or Engineering leadership with strong product ownership.
- Strong technical foundation in cloud infrastructure, distributed systems, or AI/ML platforms, with the ability to reason about system design and architecture.
- Experience working with or operating large-scale infrastructure such as GPU clusters, HPC systems, or multi-tenant cloud environments.
- Proven track record of shipping complex technical products with measurable impact on customers or platform performance.
- Strong analytical skills with experience defining metrics, working with telemetry, and driving data-informed product decisions.
- Experience leading discovery processes, including customer interviews, usage analysis, and support-driven insights.
- Ability to engage confidently with engineering teams on topics such as API design, system reliability, control planes, and distributed systems behavior.
- Excellent communication and stakeholder management skills across engineering, product, operations, and executive teams.
- High ownership mindset with a strong bias toward execution, iteration, and operational excellence.
- Familiarity with GPU infrastructure, Kubernetes, Slurm, or HPC environments is highly desirable.
- Experience with distributed ML training or inference workloads (e.g., multi-node training, NCCL, checkpointing, fault-tolerant systems) is a strong plus.
- Exposure to cloud platforms at hyperscaler scale (AWS, GCP, Azure) and developer experience design (APIs, CLI tools, observability systems) is advantageous.
- Understanding of reliability engineering practices, SRE principles, and operational metrics such as MTTR, MTBF, or system-level goodput is a plus.
- Experience working with emerging AI workloads such as agentic systems, RL pipelines, or large-scale inference serving is considered a bonus.
- Competitive compensation package.
- Opportunity to shape foundational infrastructure for the global AI ecosystem.
- High-impact role with ownership over critical components of a hyperscale AI platform.
- Flexible, trust-based work environment with strong autonomy.
- Career growth in a fast-scaling, engineering-driven organization.
- Exposure to cutting-edge AI, GPU, and distributed systems technologies.
- Collaborative international environment with world-class engineering teams.
- Opportunity to work on problems at the frontier of cloud infrastructure and AI compute.