Datacenter Hardware Operations Technician Lead, Industrial Compute in United States at Jobgether
Explore Related Opportunities
Job Description
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Datacenter Hardware Operations Technician Lead, Industrial Compute based in United States.
This role sits at the core of large-scale AI infrastructure reliability, where hands-on datacenter expertise directly supports the performance of advanced compute environments powering frontier AI systems. You will act as the senior on-site technical authority for hardware operations, ensuring the stability, availability, and lifecycle health of GPU, server, and storage systems. The position combines deep technical troubleshooting with operational leadership across high-density industrial compute campuses. You will partner closely with engineering, operations, and external vendors to resolve complex hardware issues and drive long-term reliability improvements. The environment is fast-scaling, mission-critical, and deeply collaborative, requiring both precision execution and systems-level thinking. This is a highly impactful role shaping the operational backbone of next-generation AI infrastructure.
In this role, you will lead on-site hardware operations and ensure the reliability and performance of large-scale compute infrastructure supporting mission-critical workloads.
- Serve as the senior on-site technical lead for server, GPU, storage, and rack-level hardware operations
- Drive diagnosis, triage, and resolution of complex hardware failures impacting production systems
- Lead root cause analysis (RCA) efforts and implement corrective and preventive actions to improve fleet reliability
- Partner with engineering, OEM vendors, and operations teams to manage repairs, replacements, and lifecycle activities
- Develop, refine, and standardize hardware maintenance procedures, troubleshooting runbooks, and operational best practices
- Analyze hardware failure trends and operational telemetry to identify risks and reliability improvement opportunities
- Support hardware onboarding, validation, and production readiness for new infrastructure deployments
- Mentor technicians and partner teams on advanced troubleshooting and hardware reliability practices
This role requires extensive experience in large-scale datacenter environments, with strong technical depth in hardware systems and proven leadership in operational troubleshooting.
- 8+ years of experience in datacenter hardware operations, sustaining engineering, or senior technician roles
- Strong expertise in server, GPU, storage, and rack-level infrastructure in large-scale environments
- Proven ability to diagnose complex hardware failures and lead high-priority production incident resolution
- Experience conducting root cause analysis and driving long-term reliability improvements
- Solid understanding of hardware reliability engineering, fleet health, and operational monitoring systems
- Ability to collaborate across engineering, operations, and vendor ecosystems in high-pressure environments
- Strong communication skills with experience documenting processes and influencing technical decisions
- Familiarity with Linux systems, hardware validation workflows, and datacenter tooling is a plus
- Competitive base compensation with equity and performance-based bonus eligibility
- Comprehensive medical, dental, and vision coverage with employer contributions
- 401(k) retirement plan with employer match
- Generous paid time off, holidays, and company-wide recharge breaks
- Paid parental leave, medical leave, and caregiver support programs
- Annual learning and development stipend for professional growth
- Wellness and mental health support resources
- Relocation support for eligible employees and additional lifestyle benefits