Senior Networking Solution Test Engineer – AI Cluster Debugging in Switzerland at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Networking Solution Test Engineer – AI Cluster Debugging in Switzerland.
This role sits at the forefront of large-scale AI infrastructure validation, where networking, systems engineering, and artificial intelligence workloads converge. You will be responsible for ensuring the reliability and performance of complex AI clusters built on high-speed interconnect technologies such as NVLink, Ethernet, and InfiniBand. Working in a highly technical and collaborative environment, you will investigate deep system-level issues spanning hardware, drivers, networking stacks, and AI frameworks. The position requires strong debugging intuition and the ability to reproduce and analyze real-world customer scenarios in advanced test environments. You will contribute directly to the stability and scalability of next-generation AI training and inference systems used at massive scale. This is a hands-on engineering role where your analysis and findings directly shape product quality and system performance.
- Design and review test strategies and product requirements for NVLink, Ethernet, and InfiniBand-based AI cluster systems.
- Build and maintain realistic, large-scale test environments replicating customer-like AI infrastructure, including heterogeneous hardware and software stacks.
- Lead end-to-end system debugging across hardware, firmware, networking, and AI software layers to identify and resolve root causes.
- Analyze logs, inspect source code, and validate fixes across components such as NICs, DPUs, switches, and AI communication libraries.
- Collaborate closely with development teams to debug and optimize protocols such as NCCL, RoCE, and RDMA.
- Define, design, and guide automation efforts for robust testing frameworks producing actionable logs, metrics, and traces.
- Execute regression, performance, functional, and scalability testing, and deliver clear, data-driven technical reports.
- Profile and benchmark AI training and inference workloads, correlating application behavior with system and network performance metrics.
- Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or equivalent hands-on experience in systems/network engineering.
- 8+ years of experience in Linux-based networking, system testing, and complex debugging environments.
- Strong expertise in Linux networking tools and debugging utilities (e.g., tcpdump, ethtool, iproute2, perf).
- Proven experience in production-grade troubleshooting, hypothesis-driven debugging, and root cause analysis under pressure.
- Solid understanding of NIC architecture, offloads, queue management, and driver/firmware interactions.
- Deep knowledge of AI networking technologies such as NCCL, RoCE, and RDMA.
- Ability to read, understand, and debug source code in C/C++, Python, or similar languages.
- Strong scripting and automation skills using Bash, Python, and/or Ansible.
- Experience working in fast-evolving technical environments with strong adaptability and learning ability.
- Excellent analytical, communication, and collaboration skills with strong ownership mindset.
- Competitive compensation aligned with senior-level expertise and Swiss market standards.
- Opportunity to work on cutting-edge AI cluster and high-performance networking technologies.
- Exposure to large-scale systems powering advanced AI training and inference workloads.
- Highly technical, research-driven engineering environment with strong innovation focus.
- Collaborative international team working on next-generation infrastructure challenges.
- Access to complex, large-scale test environments and advanced debugging tools.
- Inclusive workplace culture supporting diversity, equity, and professional growth.
- Relocation and accommodation of accessibility needs where applicable.