JobTarget Logo

Principal Deep Learning Communication Architect at Jobgether – United States

Jobgether
United States, United States
Posted on
New
New job! Apply early to increase your chances of getting hired.

Explore Related Opportunities

About This Position

Principal Deep Learning Communication Architect

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Principal Deep Learning Communication Architect in United States.

This is a senior technical leadership role focused on defining the future of large-scale AI communication systems powering next-generation distributed deep learning workloads. You will shape the architecture of high-performance communication libraries that enable training and inference at unprecedented scale across massive GPU clusters. Acting at the intersection of software, hardware, and AI systems, you will influence how models with hundreds of billions to trillions of parameters efficiently communicate across advanced interconnects. The role involves deep collaboration with researchers, systems engineers, and hardware architects to co-design scalable solutions for AI infrastructure. You will also contribute to optimizing collective communication frameworks and enabling efficient execution of emerging AI workloads such as agentic and multimodal models. This position is highly strategic and technical, requiring both architectural vision and hands-on systems expertise. Your work will directly impact the performance and scalability of some of the world’s most advanced AI platforms.

Accountabilities:
  • Define the long-term architecture and technical roadmap for large-scale communication libraries supporting next-generation distributed AI systems
  • Lead the design and optimization of communication primitives and collective algorithms for high-performance GPU clusters
  • Drive application and communication co-design efforts across frameworks such as NCCL, NVSHMEM, UCX, UCC, and MPI-based systems
  • Collaborate with hardware architects to influence the design of future interconnects and GPU networking technologies
  • Develop analytical models and simulation tools to evaluate system performance under large-scale AI and HPC workloads
  • Optimize communication performance across heterogeneous interconnects including NVLink, InfiniBand, and Ethernet-based architectures
  • Guide the evolution of distributed training and inference systems to support trillion-parameter and agentic AI models
  • Provide technical leadership across cross-functional teams working on AI infrastructure, runtime systems, and GPU architecture

Requirements:

  • Ph.D. or M.S. in Computer Science, Electrical Engineering, or a related field with 12+ years of experience in HPC or distributed deep learning systems
  • Deep expertise in parallel computing strategies including data, tensor, pipeline, context, and expert parallelism, as well as ZeRO optimizations
  • Strong hands-on experience with communication frameworks such as NCCL, UCX, UCC, NVSHMEM, or MPI
  • Solid understanding of RDMA, RoCE, and InfiniBand low-level networking protocols and hardware interfaces
  • Advanced knowledge of high-performance inference systems such as TensorRT-LLM, vLLM, SGLang, or NVIDIA Dynamo
  • Strong background in GPU architecture, including memory hierarchies (HBM3e/HBM4, L2 cache) and CUDA programming
  • Experience working with large-scale distributed training frameworks such as Megatron-Core, DeepSpeed, or JAX/XLA is a plus
  • Proven track record of contributing to or leading open-source projects or publishing research in top-tier systems venues
  • Strong architectural thinking, communication skills, and ability to influence cross-functional technical direction

Benefits:

  • Competitive base salary ranging from $272,000 to $431,250 USD depending on experience and location
  • Eligibility for equity participation in addition to base compensation
  • Comprehensive health, dental, vision, and wellness benefits
  • Remote or hybrid flexibility depending on role requirements
  • Opportunity to work on cutting-edge AI infrastructure at massive global scale
  • Strong focus on research, innovation, and open technical collaboration
  • Inclusive, high-performance engineering culture with world-class technical teams
  • Long-term career growth in advanced AI systems and architecture leadership roles.
How Jobgether works:
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#LI-CL1

Job Location

United States, United States

Frequently asked questions about this position

Continue to apply
Enter your email to continue. You’ll be redirected to the employer’s application.
By clicking Continue, you understand and agree to JobTarget's Terms of Use and Privacy Policy.