Senior HPC Cluster Engineer in Ireland, Scotland at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior HPC Cluster Engineer in Ireland.
Join a high-impact engineering environment focused on advancing next-generation AI cloud infrastructure at hyperscale. This role offers the opportunity to work on cutting-edge GPU computing, high-performance networking, and distributed infrastructure systems that power large-scale AI and machine learning workloads. As a Senior HPC Cluster Engineer, you will play a key role in optimizing GPU clusters, enhancing InfiniBand network performance, and ensuring reliability across complex HPC environments. Working alongside highly skilled engineers, you’ll contribute to infrastructure improvements spanning virtualization, automation, performance tuning, and hardware integration. The position combines deep technical ownership with meaningful collaboration in a fast-moving, innovation-driven environment built for ambitious engineers. If you thrive on solving low-level systems challenges and shaping the future of AI infrastructure, this opportunity offers exceptional technical depth and impact.
- Optimize the performance, scalability, and reliability of GPU clusters and InfiniBand networks within high-performance computing environments.
- Analyze, troubleshoot, and resolve root-cause issues related to GPU infrastructure, networking performance, and distributed computing systems.
- Integrate and support new hardware technologies, including GPU devices, within existing cloud and virtualization environments.
- Configure, maintain, and improve GPU device orchestration, Kubernetes integrations, and virtualization stacks such as QEMU and KVM.
- Design and enhance automation systems for proactive monitoring, fault detection, and issue remediation across HPC clusters.
- Collaborate with engineering teams to improve infrastructure efficiency, system performance, and platform resilience.
- Conduct performance analysis and optimization for HPC workloads, including AI/ML training, simulations, and large-scale data processing.
- Contribute to infrastructure evolution by improving networking, virtualization, and distributed computing capabilities.
Requirements:
- 5+ years of professional experience in system-level software engineering, low-level programming, or infrastructure performance optimization.
- 3+ years of hands-on Linux systems experience, including administration, troubleshooting, and performance tuning.
- Strong understanding of server architecture, Linux kernel fundamentals, PCIe devices, NICs, and high-performance computing systems.
- Strong programming expertise in languages such as C, C++, Go, Python, or similar performance-oriented technologies.
- Experience working with GPU infrastructure, HPC environments, distributed systems, or performance-critical systems.
- Familiarity with containerization, orchestration, or virtualization technologies including Kubernetes, QEMU, or KVM.
- Strong troubleshooting, systems analysis, and problem-solving capabilities in complex technical environments.
- Excellent collaboration and communication skills with the ability to work effectively in cross-functional engineering teams.
- Bonus points for experience with InfiniBand, RDMA, RoCE, MPI, NCCL, SDN, or GPU cluster testing environments.
- Additional experience with deep learning frameworks such as PyTorch or TensorFlow and AI/ML infrastructure is highly valued.
Benefits:
- Competitive compensation package aligned with experience and technical expertise.
- Flexible remote work environment supporting work-life balance across Europe.
- Career development opportunities with continuous learning and professional growth support.
- Opportunity to work on impactful AI and high-performance computing projects at global scale.
- Collaborative, innovative, and engineering-focused culture built around ownership and autonomy.
- Exposure to world-class international engineering teams across cloud, infrastructure, and AI domains.
- Fast-moving environment with meaningful technical challenges and long-term career progression.
- Inclusive workplace committed to diversity, equal opportunity, and employee growth.