Architect - Platform Engineer in United States at Jobgether
Explore Related Opportunities
Job Description
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for an Architect - Platform Engineer based in the United States.
This is a senior-level architecture role focused on designing and scaling next-generation infrastructure for GenAI and large language model (LLM) workloads in enterprise and production environments. You will define the platform foundations that power distributed training, GPU-accelerated computing, and AI model deployment at scale. The role blends deep systems engineering expertise with modern cloud-native architecture, requiring strong fluency across Kubernetes, high-performance computing, and AI infrastructure stacks. You will collaborate with data scientists, ML engineers, and software architects to deliver robust, scalable GenAI platforms. The environment is highly innovative, fast-paced, and centered on cutting-edge AI transformation across industries. This role is ideal for a hands-on architect who thrives at the intersection of infrastructure, performance engineering, and applied AI systems.
- Design, build, and optimize scalable infrastructure for GenAI and LLM workloads across multi-GPU and distributed computing environments.
- Architect and manage high-performance compute platforms using Slurm clusters and container orchestration systems such as Kubernetes and OpenShift.
- Lead GPU performance profiling, benchmarking, and optimization for distributed training and inference workloads.
- Enable and maintain NVIDIA GPU ecosystem components including CUDA, cuDNN, NCCL, Triton, and related tooling.
- Develop and operationalize GenAI pipelines supporting fine-tuning, RAG architectures, multi-modal systems, and LLMOps workflows.
- Build reusable infrastructure-as-code templates using tools such as Terraform and Helm to support scalable deployments.
- Collaborate with cross-functional engineering teams to deploy AI solutions into both research and production environments.
- Drive automation, CI/CD practices, and platform reliability through modern DevOps and cloud engineering principles.
- Lead technical architecture discussions with internal and client-facing stakeholders, providing scalable and production-ready solutions.
- 10+ years of experience in platform engineering, infrastructure architecture, or high-performance computing environments.
- Strong hands-on expertise with Kubernetes and/or Red Hat OpenShift in production-scale deployments.
- Deep knowledge of GPU computing ecosystems including CUDA, cuDNN, NCCL, Nsight, and TensorRT/Triton.
- Proven experience with Slurm-based distributed training systems and multi-GPU optimization.
- Strong Linux systems expertise with performance tuning and infrastructure scaling experience.
- Experience building and deploying GenAI workloads such as LLM fine-tuning, RAG pipelines, or multimodal AI systems.
- Solid understanding of infrastructure-as-code tools including Terraform and Ansible.
- Experience working with cloud GPU environments (AWS, Azure, GCP, OCI) or on-prem GPU clusters.
- Strong communication and leadership skills with experience mentoring teams and driving architecture decisions.
- Ability to work in client-facing environments and translate technical complexity into scalable solutions.
- Competitive compensation aligned with senior-level platform engineering roles
- Remote-first flexibility across the United States and Canada regions
- Opportunity to work on cutting-edge GenAI and LLM infrastructure at enterprise scale
- Exposure to leading cloud and AI ecosystems including major hyperscalers and GPU platforms
- Career growth within a fast-scaling AI-first engineering organization
- Hands-on work with advanced technologies such as distributed training, GPU clusters, and LLM systems
- Collaborative, innovation-driven environment with strong emphasis on learning and technical excellence
- Opportunity to work on high-impact AI transformation projects across multiple industries.