Sr Kubernetes Support Engineer at Applied Digital Corporation – Dallas, Texas
About This Position
About Applied Digital:
At Applied Digital, we are the epicenter of AI innovation, crafting cutting-edge data center solutions tailored for the demands of high-performance computing. Designed from the ground up to support AI and machine learning workloads, our infrastructure is the backbone of tomorrow’s technological advancements, including AI-driven video and generative platforms.
We are:
- Forward-Thinkers: With a keen eye on current market trends and future innovations, we adapt swiftly and lead technological evolution.
- Resilient: We navigate complex challenges and emerge stronger, delivering robust and reliable solutions for industry pioneers.
- Innovative Designers: Leveraging the latest technologies, we create visionary solutions that redefine industry standards.
At Applied Digital, we are committed to solving intricate problems, advancing business initiatives, maximizing operational efficiency, and reducing our carbon footprint. We are a team of resilient, forward-thinking innovators driving the AI revolution.
Position Summary:
Applied Digital is seeking an experienced Sr Kubernetes Support Engineer to help manage our deployed K8 system, both internal and external. This role will help us support, design and maintain the complex systems that live on our cloud platforms. This role will sit at the center of our product helping develop our entire resource provisioning lifecycle from a single API request to the scheduling and spin-up of multiple resources.
You will be the primary source of contact for our customers using K8 and also taking an architect role for our core provisioning logic, creating a robust system that intelligently orchestrates Kubernetes clusters, Micro VMs and Slurm-managed HPC resources. You will work closely with our front-end team to build the resources that expose this power to our users, and with the infrastructure team to ensure the backend is scalable, resilient, and efficient.
The ideal candidate is a strong systems-level thinker who is passionate about automation, distributed systems, and building powerful HPC clusters that are easily adaptable to a customer’s design requirements.
Key Responsibilities:
- Design & Develop Provisioning Services: Architect and write high-quality, scalable backend services (e.g., in Go, Python, or Rust) that handle the logic for provisioning and managing compute and storage resources.
- Kubernetes Design and Integration: Develop controllers and operators to automate the deployment and lifecycle management of containerized workloads and services on multiple Kubernetes clusters.
- Slurm Orchestration: Build the "bridge" between our cloud-native API and our HPC backend, writing the logic to dynamically generate Slurm batch scripts, submit jobs, and monitor their state.
- MicroVM Management: Implement provisioning workflows for lightweight MicroVMs (using technologies like Firecracker, KubeVirt, or Kata Containers) to ensure fast-boot times and secure workload isolation.
- Storage Provisioning: Write the automation to dynamically provision, attach, and manage various storage solutions (e.g., block storage, shared file systems) for provisioned workloads.
- Observability & Monitoring: Implement comprehensive monitoring, logging, and tracing (using tools like Prometheus, Grafana, Loki) to ensure the health and performance of all systems.
- Infrastructure as Code (IaC): Use tools like Terraform, Ansible and Git to track and manage code version for the Kubernetes cluster and related infrastructure.
Basic Qualifications:
- 10+ years of professional Kubernetes development experience, with a strong focus on building scalable distributed systems. Deep, hands-on experience with Kubernetes in a production environment (cluster management, writing operators, controllers, and custom resource definitions (CRDs)).
- Proficiency in a modern language (e.g., Go, Python, Bash, JSON).
- Solid understanding of container technologies (Docker, container) and the container ecosystem.
- Experience with Infrastructure as Code (IaC) tools like Terraform or Ansible.
- Experience collaborating with front-end teams and defining API contracts.
- Preferred Qualifications
- Direct experience with Slurm or other HPC schedulers (e.g., LSF, PBS).
- Experience with MicroVM or sandboxed container technologies (e.g., Firecracker, Kata Containers, gVisor, KubeVirt).
- Knowledge of scalable storage solutions (e.g., Weka, Ceph, MinIO, or cloud-provider storage like S3, EBS).
- Experience building CI/CD pipelines (e.g., Jenkins, GitLab CI, ArgoCD).
- Familiarity with monitoring and observability stacks (Prometheus, Grafana, ELK/Loki).
- Contributions to open-source projects.
Please note that Applied Digital is currently unable to sponsor new applicants for employment authorization or provide immigration-related support for this position. This includes, but is not limited to, visa categories such as H-1B, F-1 OPT, F-1 STEM OPT, F-1 CPT, J-1, TN, E-2, E-3, L-1, O-1, and any Employment Authorization Documents (EADs) or other work authorizations that require employer sponsorship.
Physical Requirements:
- Able to remain in a seated position for an extended period and to lift and carry up to 15 lbs. (office manuals, case notebooks, case files, case materials, standard boxes, report binders, etc.) as needed.
The company has reviewed this job description to ensure that essential functions and basic duties have been included. It is intended to provide guidelines for job expectations and the employee's ability to perform the position described. It is not intended to be construed as an exhaustive list of all functions, responsibilities, skills and abilities. Additional functions and requirements may be assigned by supervisors as deemed appropriate. This document does not represent a contract of employment, and the company reserves the right to change this job description and/or assign tasks for the employee to perform, as the company may deem appropriate.
This job description in no way states or implies that these are the only duties to be performed by the employee(s) incumbent in this position. Employees will be required to follow any other job-related instructions and to perform any other job-related duties requested by any person authorized to give instructions or assignments. All duties and responsibilities are essential functions and requirements and are subject to possible modification to reasonably accommodate individuals with disabilities. To perform this job successfully, the incumbents acknowledge that they possess the skills, aptitudes, and abilities to perform each duty proficiently. Some requirements may exclude individuals who pose a direct threat or significant risk to the health or safety of themselves or others. This document does not create an employment contract, implied or otherwise, other than an “at will” relationship.
The company is an Equal Opportunity Employer, drug free workplace, and complies with ADA regulations as applicable.
Scan to Apply
Job Location
Job Location
This job is located in the Dallas, Texas, 75219, United States region.