Software Engineer, Compute Infrastructure in United States at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Software Engineer, Compute Infrastructure in United States.
This role sits at the core of building and scaling the infrastructure that powers large-scale AI systems, transforming massive compute resources into a reliable, efficient, and high-performance platform. You will work across the full infrastructure stack, from hardware and networking to orchestration, storage, and developer tooling, enabling researchers and product teams to run complex workloads with speed and reliability. The environment is highly technical and deeply collaborative, where small improvements in systems performance, scheduling, or observability can have significant downstream impact. You will contribute to designing and operating distributed systems that span accelerators, CPUs, networks, and data centers. This role offers exposure to cutting-edge compute environments and the opportunity to directly influence the efficiency and scalability of frontier AI workloads. It is ideal for engineers who enjoy working across systems layers and solving deeply complex infrastructure challenges.
- Design, build, and optimize large-scale compute infrastructure systems supporting high-performance AI workloads across distributed environments.
- Develop and operate infrastructure spanning compute, networking, storage, orchestration, and cluster scheduling systems.
- Improve performance and reliability through profiling, benchmarking, and optimization of workloads across compute, memory, and network layers.
- Build automation and tooling for provisioning, monitoring, incident response, and lifecycle management of compute resources.
- Contribute to the design of developer platforms, observability tools, CaaS systems, and agent infrastructure to improve usability and efficiency.
- Collaborate with research, hardware, networking, and operations teams to ensure efficient and scalable compute capacity.
- Identify system bottlenecks and translate operational insights into durable infrastructure improvements and abstractions.
- Support the evolution of platform architecture to better support heterogeneous and large-scale compute environments.
- Strong software engineering background with experience in production-grade infrastructure systems.
- Experience in one or more areas such as distributed systems, high-performance computing, networking, storage systems, Kubernetes, observability, or infrastructure tooling.
- Solid understanding of system-level performance optimization, debugging, and large-scale system behavior.
- Familiarity with GPU infrastructure, RDMA, NCCL, or other high-performance communication frameworks is a plus.
- Ability to work across hardware, software, and networking layers to diagnose and resolve complex issues.
- Strong ownership mindset with the ability to operate effectively in ambiguous and fast-changing environments.
- Excellent collaboration and communication skills across multidisciplinary engineering teams.
- Motivation to build scalable infrastructure that enables advanced AI research and production systems.
- Competitive compensation aligned with experience and market standards.
- Comprehensive health, dental, and vision insurance coverage.
- Flexible work arrangements supporting collaboration across distributed teams.
- Opportunity to work on cutting-edge AI infrastructure at massive scale.
- High-impact role with direct contribution to frontier AI research and systems.
- Professional growth in a highly technical and research-driven engineering environment.
- Inclusive and mission-driven workplace culture focused on safety, collaboration, and innovation.