JobTarget Logo

Infrastructure Operations Engineer in United States at Jobgether

NewJob Function: Admin/Clerical/Secretarial
Jobgether
United States, United States
Posted on
New job! Apply early to increase your chances of getting hired.

Explore Related Opportunities

Job Description

Infrastructure Operations Engineer

This position is posted by Jobgether on behalf of a partner company. We are currently looking for an Infrastructure Operations Engineer in the United States.

In this role, you will help operate and scale large-scale AI and GPU infrastructure that powers next-generation machine learning workloads across research, startup, and enterprise environments. You will work at the intersection of reliability engineering, cloud operations, and automation, ensuring that complex distributed systems remain performant, observable, and resilient. This position offers hands-on exposure to bare metal infrastructure, Kubernetes environments, and cloud platforms, with a strong emphasis on operational excellence and automation. You will collaborate closely with infrastructure engineers, network specialists, and software teams to resolve incidents, improve system reliability, and reduce operational friction. Operating in a fast-moving environment, you will contribute directly to platform stability and customer success. This is a highly technical and impactful role for engineers who thrive in complex infrastructure ecosystems and enjoy building scalable operational systems.

Accountabilities:

In this role, you will be responsible for ensuring the reliability, scalability, and efficiency of large-scale infrastructure systems supporting GPU and cloud-based workloads.

  • Operate, monitor, and maintain large-scale Linux-based and GPU-enabled infrastructure environments
  • Support provisioning, deployment, and lifecycle management of compute and storage systems
  • Build automation and tooling to reduce operational overhead and improve system reliability
  • Manage and optimize cloud infrastructure components across AWS and hybrid environments
  • Work with Kubernetes clusters and containerized workloads to ensure system stability and performance
  • Support incident response, troubleshooting, and root cause analysis in production environments
  • Implement and improve observability solutions using monitoring and logging tools such as Prometheus and ELK
  • Collaborate with engineering and network teams to improve infrastructure design and operational workflows
  • Participate in on-call rotations and ensure timely resolution of production issues
  • Contribute to infrastructure improvements, including GitOps workflows and configuration management
Requirements:

This role requires strong infrastructure engineering experience with deep expertise in systems operations, cloud platforms, and automation.

  • 8+ years of experience working with Linux systems in production environments
  • 5+ years of experience with AWS infrastructure and cloud services
  • 2+ years of experience with Kubernetes and containerized workloads
  • Hands-on experience with Terraform and Ansible for infrastructure as code
  • Experience managing network-attached storage systems (e.g., NFS, Ceph, or similar)
  • Strong understanding of monitoring and observability tools such as Prometheus and ELK stack
  • Familiarity with GitOps workflows and modern infrastructure automation practices
  • Programming or scripting experience in Python, Go, Bash, or similar languages for automation
  • Strong networking fundamentals, including understanding of distributed systems and datacenter environments
  • Experience working with bare metal systems, GPU infrastructure, or large-scale compute environments is highly valued
  • Strong problem-solving skills and ability to operate effectively in ambiguous, fast-changing environments
  • Excellent communication skills and ability to collaborate across technical teams
Benefits:
  • Competitive salary ($160,000–$200,000 USD base range) plus equity and potential bonus
  • Fully flexible work environment (remote or hybrid within the United States)
  • Comprehensive medical, dental, and vision coverage (U.S. employees)
  • Retirement and financial wellness programs
  • Generous paid time off and company holidays
  • Paid parental leave
  • Professional development and learning support
  • Wellness, home-office, and work-from-home stipends
  • Opportunity to work on cutting-edge AI and GPU infrastructure at scale
How Jobgether works:
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#LI-CL1

Job Location

United States, United States

Frequently asked questions about this position

Continue to apply
Enter your email to continue. You’ll be redirected to the employer’s application.
By clicking Continue, you understand and agree to JobTarget's Terms of Use and Privacy Policy.