ML Infrastructure Engineer in Germany at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a ML Infrastructure Engineer in Germany.
Join a cutting-edge AI infrastructure environment focused on powering the next generation of machine learning and large-scale AI workloads. This role offers the opportunity to work at the intersection of GPU performance engineering, deep learning optimization, and cloud-scale infrastructure development. You will contribute directly to benchmarking and optimizing advanced GPU platforms that support training and inference for complex neural networks and AI systems. Working alongside highly skilled engineering and hardware teams, you will help drive performance improvements across compute architectures, software stacks, and distributed AI environments. The position is ideal for engineers passionate about ML systems, large-scale model performance, and infrastructure innovation. With exposure to modern AI frameworks, high-performance GPU ecosystems, and international collaboration, this role provides a strong platform for technical growth and meaningful impact within the AI industry.
- Benchmark and evaluate GPU platform performance for machine learning and AI workloads across various architectures, frameworks, and software environments.
- Collaborate closely with hardware and engineering teams to profile GPU performance at system and kernel levels and identify optimization opportunities.
- Analyze, debug, and optimize training and inference workloads to improve efficiency, scalability, and overall hardware utilization.
- Conduct acceptance testing for new GPU clusters to validate performance, stability, compatibility, and operational readiness for AI workloads.
- Perform experiments across multiple GPU configurations and interconnect strategies to assess system-level scalability and performance trade-offs.
- Develop internal tools, dashboards, and reporting frameworks to visualize performance metrics, bottlenecks, and infrastructure trends.
- Contribute to infrastructure best practices, internal tooling enhancements, and benchmarking methodologies for AI and ML environments.
- Support ongoing platform optimization efforts related to distributed training, inference acceleration, parallelism strategies, and hardware-aware performance tuning.
Requirements:
- Strong theoretical foundation in machine learning, deep learning architectures, and AI system optimization principles.
- Deep understanding of performance optimization techniques for large neural network training and inference, including parallelism strategies, kernel optimization, batching, and hardware acceleration.
- Extensive experience with modern deep learning frameworks such as PyTorch, JAX, Megatron-LM, TensorRT-LLM, or equivalent technologies.
- Solid expertise with GPU technologies and software stacks including CUDA, NCCL, GPU drivers, and performance-related libraries.
- Experience profiling and debugging GPU workloads using tools such as Nsight, nvprof, perf, or similar performance analysis platforms.
- Familiarity with containerized and distributed environments including Docker and Kubernetes.
- Strong programming and scripting skills, particularly in Python and performance-oriented development workflows.
- Excellent problem-solving, analytical thinking, and communication skills with the ability to work independently in highly technical environments.
- Experience with LLM inference frameworks such as vLLM, SGLang, or TensorRT is considered a strong advantage.
- Familiarity with cloud-based ML ecosystems such as AWS, Google Cloud Platform, or Azure ML is beneficial.
- Contributions to open-source ML tooling, benchmarking frameworks, or infrastructure projects are highly valued.
Benefits:
- Competitive compensation package aligned with experience and technical expertise.
- Flexible remote work environment supporting strong work-life balance.
- Access to continuous learning, career development, and growth opportunities within the AI infrastructure space.
- Opportunity to work on impactful AI projects shaping the future of machine learning infrastructure and cloud computing.
- Collaborative and innovation-driven engineering culture with strong technical ownership and autonomy.
- International work environment with exposure to globally distributed teams and advanced AI technologies.
- Fast-paced setting focused on bold thinking, experimentation, and continuous technical evolution.
- Opportunity to contribute to high-performance AI systems used by developers and enterprises worldwide.