JobTarget Logo

Senior Site Reliability Engineer, AI Factory at Jobgether – United States

Jobgether
United States, United States
Posted on
NewJob Function:Engineering
New job! Apply early to increase your chances of getting hired.

About This Position

Senior Site Reliability Engineer, AI Factory

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer, AI Factory in the United States.
This role focuses on designing, operating, and optimizing next-generation GPU-accelerated data centers at scale, ensuring performance, reliability, and efficiency for AI workloads. You will lead the end-to-end lifecycle of critical infrastructure, from provisioning and commissioning to day-to-day operations, while collaborating across hardware, software, and operational teams. Success in this position requires deep technical expertise, hands-on problem solving, and a passion for open-source solutions and automation. You will help define operational standards for large-scale AI facilities, drive continuous improvement, and implement processes that maintain uptime while enabling cutting-edge innovation. This role offers the opportunity to impact global AI infrastructure and work in a high-performance, collaborative environment with engineers tackling unique telemetry, orchestration, and reliability challenges.
Accountabilities:
  • Architect, commission, and provision GPU systems at large scale, ensuring supported firmware and component versions are maintained across operations.
  • Lead Day-2 operations, monitoring cluster hardware, identifying bottlenecks, and optimizing efficiency, performance, and availability.
  • Triage hardware break-fix issues, develop automated solutions, and continuously improve operational workflows.
  • Collaborate with hardware, software, and technical teams to define repeatable procedures and operational strategies aligned with SLAs.
  • Develop and enforce quality control procedures to minimize downtime and maintain high reliability for mission-critical AI infrastructure.
  • Provide documentation and operational guidance to support global AI data center deployments and internal teams.
  • Feed hardware and software requirements into engineering pipelines and coordinate with remote hands and field teams.

Requirements:
  • Bachelors or Masters degree in Computer Engineering, Computer Science, or a related field, or equivalent experience.
  • 10+ years of experience in data center operations, site reliability, or critical infrastructure management.
  • Proven experience managing GPU fleets and large-scale computing environments.
  • Expertise in BMS, power management, and commissioning/provisioning processes.
  • Hands-on experience with configuration management, Packer, QCOW2 images, and Datacenter Inventory Management Systems (Netbox, Nautilus, or similar).
  • Strong track record of cross-team collaboration to deliver operational excellence and reliability improvements.
  • Knowledge of automated break-fix solutions, message bus systems, workflow engines, and Zero Touch Provisioning is highly desirable.
  • Excellent problem-solving skills, attention to detail, and the ability to implement robust processes for uptime and performance optimization.

Benefits:
  • Competitive base salary: $176,000$276,000 (Level 4) or $208,000$333,500 (Level 5), based on experience and location.
  • Equity participation and bonus eligibility.
  • Comprehensive medical, dental, and vision coverage.
  • Paid leave, holidays, and flexible work arrangements.
  • Professional development opportunities and access to learning platforms.
  • Retirement plans and financial wellness programs.
  • Collaborative environment with exposure to cutting-edge AI and open-source data center technologies.
Why Apply Through Jobgether?
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.


#LI-CL1

Job Location

United States, United States

Frequently asked questions about this position

Continue to apply
Enter your email to continue. You’ll be redirected to the employer’s application.
By clicking Continue, you understand and agree to JobTarget's Terms of Service and Privacy Policy.
Apply Now