Principal Deep Learning Communication Architect at Jobgether – United States
Explore Related Opportunities
About This Position
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Principal Deep Learning Communication Architect in United States.
This is a senior technical leadership role focused on defining the future of large-scale AI communication systems powering next-generation distributed deep learning workloads. You will shape the architecture of high-performance communication libraries that enable training and inference at unprecedented scale across massive GPU clusters. Acting at the intersection of software, hardware, and AI systems, you will influence how models with hundreds of billions to trillions of parameters efficiently communicate across advanced interconnects. The role involves deep collaboration with researchers, systems engineers, and hardware architects to co-design scalable solutions for AI infrastructure. You will also contribute to optimizing collective communication frameworks and enabling efficient execution of emerging AI workloads such as agentic and multimodal models. This position is highly strategic and technical, requiring both architectural vision and hands-on systems expertise. Your work will directly impact the performance and scalability of some of the world’s most advanced AI platforms.
- Define the long-term architecture and technical roadmap for large-scale communication libraries supporting next-generation distributed AI systems
- Lead the design and optimization of communication primitives and collective algorithms for high-performance GPU clusters
- Drive application and communication co-design efforts across frameworks such as NCCL, NVSHMEM, UCX, UCC, and MPI-based systems
- Collaborate with hardware architects to influence the design of future interconnects and GPU networking technologies
- Develop analytical models and simulation tools to evaluate system performance under large-scale AI and HPC workloads
- Optimize communication performance across heterogeneous interconnects including NVLink, InfiniBand, and Ethernet-based architectures
- Guide the evolution of distributed training and inference systems to support trillion-parameter and agentic AI models
- Provide technical leadership across cross-functional teams working on AI infrastructure, runtime systems, and GPU architecture
Requirements:
- Ph.D. or M.S. in Computer Science, Electrical Engineering, or a related field with 12+ years of experience in HPC or distributed deep learning systems
- Deep expertise in parallel computing strategies including data, tensor, pipeline, context, and expert parallelism, as well as ZeRO optimizations
- Strong hands-on experience with communication frameworks such as NCCL, UCX, UCC, NVSHMEM, or MPI
- Solid understanding of RDMA, RoCE, and InfiniBand low-level networking protocols and hardware interfaces
- Advanced knowledge of high-performance inference systems such as TensorRT-LLM, vLLM, SGLang, or NVIDIA Dynamo
- Strong background in GPU architecture, including memory hierarchies (HBM3e/HBM4, L2 cache) and CUDA programming
- Experience working with large-scale distributed training frameworks such as Megatron-Core, DeepSpeed, or JAX/XLA is a plus
- Proven track record of contributing to or leading open-source projects or publishing research in top-tier systems venues
- Strong architectural thinking, communication skills, and ability to influence cross-functional technical direction
Benefits:
- Competitive base salary ranging from $272,000 to $431,250 USD depending on experience and location
- Eligibility for equity participation in addition to base compensation
- Comprehensive health, dental, vision, and wellness benefits
- Remote or hybrid flexibility depending on role requirements
- Opportunity to work on cutting-edge AI infrastructure at massive global scale
- Strong focus on research, innovation, and open technical collaboration
- Inclusive, high-performance engineering culture with world-class technical teams
- Long-term career growth in advanced AI systems and architecture leadership roles.