Infrastructure Operations Engineer in United States at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for an Infrastructure Operations Engineer in the United States.
In this role, you will help operate and scale large-scale AI and GPU infrastructure that powers next-generation machine learning workloads across research, startup, and enterprise environments. You will work at the intersection of reliability engineering, cloud operations, and automation, ensuring that complex distributed systems remain performant, observable, and resilient. This position offers hands-on exposure to bare metal infrastructure, Kubernetes environments, and cloud platforms, with a strong emphasis on operational excellence and automation. You will collaborate closely with infrastructure engineers, network specialists, and software teams to resolve incidents, improve system reliability, and reduce operational friction. Operating in a fast-moving environment, you will contribute directly to platform stability and customer success. This is a highly technical and impactful role for engineers who thrive in complex infrastructure ecosystems and enjoy building scalable operational systems.
In this role, you will be responsible for ensuring the reliability, scalability, and efficiency of large-scale infrastructure systems supporting GPU and cloud-based workloads.
- Operate, monitor, and maintain large-scale Linux-based and GPU-enabled infrastructure environments
- Support provisioning, deployment, and lifecycle management of compute and storage systems
- Build automation and tooling to reduce operational overhead and improve system reliability
- Manage and optimize cloud infrastructure components across AWS and hybrid environments
- Work with Kubernetes clusters and containerized workloads to ensure system stability and performance
- Support incident response, troubleshooting, and root cause analysis in production environments
- Implement and improve observability solutions using monitoring and logging tools such as Prometheus and ELK
- Collaborate with engineering and network teams to improve infrastructure design and operational workflows
- Participate in on-call rotations and ensure timely resolution of production issues
- Contribute to infrastructure improvements, including GitOps workflows and configuration management
This role requires strong infrastructure engineering experience with deep expertise in systems operations, cloud platforms, and automation.
- 8+ years of experience working with Linux systems in production environments
- 5+ years of experience with AWS infrastructure and cloud services
- 2+ years of experience with Kubernetes and containerized workloads
- Hands-on experience with Terraform and Ansible for infrastructure as code
- Experience managing network-attached storage systems (e.g., NFS, Ceph, or similar)
- Strong understanding of monitoring and observability tools such as Prometheus and ELK stack
- Familiarity with GitOps workflows and modern infrastructure automation practices
- Programming or scripting experience in Python, Go, Bash, or similar languages for automation
- Strong networking fundamentals, including understanding of distributed systems and datacenter environments
- Experience working with bare metal systems, GPU infrastructure, or large-scale compute environments is highly valued
- Strong problem-solving skills and ability to operate effectively in ambiguous, fast-changing environments
- Excellent communication skills and ability to collaborate across technical teams
- Competitive salary ($160,000–$200,000 USD base range) plus equity and potential bonus
- Fully flexible work environment (remote or hybrid within the United States)
- Comprehensive medical, dental, and vision coverage (U.S. employees)
- Retirement and financial wellness programs
- Generous paid time off and company holidays
- Paid parental leave
- Professional development and learning support
- Wellness, home-office, and work-from-home stipends
- Opportunity to work on cutting-edge AI and GPU infrastructure at scale