Staff Software Engineer - Reliability in Palo Alto, California at Luma AI
Employment Type: Full-Time
Luma AI
Palo Alto, California, United States
Posted on
Explore Related Opportunities
Software and Web Developers, Programmers, and Testers jobs near me in CaliforniaJobs near me in CaliforniaSoftware and Web Developers, Programmers, and Testers jobs
Job Description
Position: Staff Software Engineer - Reliability
Job Description:Luma AI runs on thousands of H100 GPUs across multiple providers and clusters for Training, Data Processing and Inference. Working with the Infrastructure and Research teams, the Staff Software Engineer – Reliability maintains the health of our GPU clusters, developing the monitoring and management tools necessary to maximize their performance.
Specific job duties include the following:
Job Requirements:Requires a Master’s degree (or foreign equivalent) in Computer Science, Information Technology, Electronic Engineering, or related field of study, plus 2 years of experience in the job offered, Software Engineer, or a related occupation.
Position requires at least 2 years of experience in the following skills:
Job Description:Luma AI runs on thousands of H100 GPUs across multiple providers and clusters for Training, Data Processing and Inference. Working with the Infrastructure and Research teams, the Staff Software Engineer – Reliability maintains the health of our GPU clusters, developing the monitoring and management tools necessary to maximize their performance.
Specific job duties include the following:
- Collaborate with researchers and engineers to specify the availability, performance, correctness, and efficiency requirements of the current and future versions of our GPU infrastructure. (15%)
- Work with multiple GPU cloud providers to scale up, scale down, maintain and monitor our GPUs in many clusters. (20%)
- Design and implement solutions to ensure the scalability of our infrastructure to meet rapidly increasing demands. (15%)
- Implement and manage monitoring systems to proactively identify issues and anomalies in our production environment. (10%)
- Implement fault-tolerant and resilient design patterns to minimize service disruptions. (10%)
- Build and maintain automation tools to streamline repetitive tasks and improve system reliability. (15%)
- Participate in an on-call rotation to respond to critical incidents and ensure 24/7 system availability alongside other infrastructure developers. (5%)
- Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure system reliability. (10%)
Job Requirements:Requires a Master’s degree (or foreign equivalent) in Computer Science, Information Technology, Electronic Engineering, or related field of study, plus 2 years of experience in the job offered, Software Engineer, or a related occupation.
Position requires at least 2 years of experience in the following skills:
- CI/CD pipelines and automation using AWS and Kubernetes.
- CI/CD using gitlab, docker, Kubernetes.
- AWS infrastructure using Terraform, packer.
- Certs LifeCycle Management using vault with PKI.
- Python to automate every aspect of the pipeline.
- Python and Shell Scripts for build automation.
- Jenkins servers for continuous integration.
Scan to Apply
Just scan this QR code to apply from your phone.
Job Location
Palo Alto, California, United States
Frequently asked questions about this position
Similar Jobs In Palo Alto, California
New
Senior UX Designer
GoodLeap
San Francisco, California
New
Senior Staff Software Engineer - Binary Log Data Replication
Fivetran
Oakland, California
New
Java Full Stack Developer
Bright Vision Technologies
Fremont, California
New
Golang Developer
Bright Vision Technologies
Fremont, California
New
Sr. Software Engineer - Cloud (Hybrid)
CrowdStrike, Inc.
Sunnyvale, California
Continue to apply
Enter your email to continue. You’ll be redirected to the employer’s application.By clicking Continue, you understand and agree to JobTarget's Terms of Use and Privacy Policy.