Infrastructure Engineer (GPU & Compute) in United States at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for an Infrastructure Engineer (GPU & Compute) in the United States.
This role is at the core of building and scaling high-performance infrastructure designed for modern AI and machine learning workloads. You will work across hardware, systems, and software layers to ensure GPU-enabled environments are reliable, efficient, and production-ready from day one. The position combines deep technical expertise with hands-on ownership of image pipelines, system validation, and large-scale compute environments. You will play a critical role in enabling seamless deployment and operation of cutting-edge AI infrastructure by improving automation, diagnostics, and performance. Collaborating with cross-functional teams, you will help bring new systems online, validate next-generation hardware, and enhance operational efficiency. This is a high-impact opportunity within a fast-paced, innovation-driven environment focused on scaling compute for the future of AI.
- Own and evolve systems for image management, deployment, and validation across large-scale bare-metal and GPU-enabled infrastructure environments.
- Maintain and operate validation clusters used for system diagnostics, testing, and infrastructure bring-up to ensure readiness and reliability.
- Lead GPU diagnostics and validation workflows, identifying performance bottlenecks, failure patterns, and system-level issues across hardware and software layers.
- Build and enhance automation tools and workflows (primarily in Python) to streamline provisioning, validation, and operational processes.
- Support hardware qualification efforts for new platforms, including firmware, drivers, and operating system validation.
- Manage Linux-based production and validation environments, including virtualization and bare-metal provisioning systems (e.g., PXE workflows).
- Collaborate with infrastructure, hardware, data center, and ML teams to align systems with workload requirements and ensure optimal performance.
- Contribute to best practices for infrastructure lifecycle management, system diagnostics, and scalability improvements.
- 5+ years of experience in infrastructure engineering, systems engineering, or related technical roles.
- Strong expertise in Linux systems administration within production or large-scale environments.
- Hands-on experience with GPU-enabled systems and performance/monitoring tools such as NVIDIA DCGM.
- Solid understanding of bare-metal provisioning, system bring-up processes, and image-based deployment workflows.
- Proficiency in Python or similar programming/scripting languages for building automation tools.
- Demonstrated ability to troubleshoot complex issues across hardware, operating systems, GPUs, and system software layers.
- Familiarity with hardware management interfaces such as IPMI, iDRAC, or Redfish.
- Experience working with data center infrastructure and physical hardware environments is highly valued.
- Bonus: Experience with high-performance interconnects (InfiniBand, NVLink), AI/ML or HPC workloads, and large-scale hardware validation frameworks.
- Competitive base salary ranging from $180,000 to $200,000 USD, based on experience and location.
- Performance-based bonus and meaningful equity participation.
- Comprehensive medical, dental, and vision coverage.
- Retirement and financial wellness programs.
- Generous paid time off, holidays, and paid parental leave.
- Flexible remote or hybrid work options within the United States.
- Professional development support and learning opportunities.
- Wellness and home office stipends.
- Inclusive and collaborative work environment focused on innovation and balance.