Can I apply directly for this job on this page?

Yes, you can begin your application on this page using a quick form. You'll then be redirected to the employer's career site to complete the full application process.

What is the role of a Staff Software Engineer - Reliability at Luma AI?

The Staff Software Engineer - Reliability position at Luma AI is a Full-Time opportunity in the relevant field.

Where is this Staff Software Engineer - Reliability job located?

Palo Alto, California, United States

What industry does this Staff Software Engineer - Reliability position belong to?

This role spans multiple industries.

What is the expected salary for this Staff Software Engineer - Reliability job?

Compensation will be discussed during the hiring process.

Staff Software Engineer - Reliability job near me in Palo Alto, California at Luma AI | Jobs and Employment

Position: Staff Software Engineer - Reliability
Job Description:Luma AI runs on thousands of H100 GPUs across multiple providers and clusters for Training, Data Processing and Inference. Working with the Infrastructure and Research teams, the Staff Software Engineer – Reliability maintains the health of our GPU clusters, developing the monitoring and management tools necessary to maximize their performance.
Specific job duties include the following:

Collaborate with researchers and engineers to specify the availability, performance, correctness, and efficiency requirements of the current and future versions of our GPU infrastructure. (15%)
Work with multiple GPU cloud providers to scale up, scale down, maintain and monitor our GPUs in many clusters. (20%)
Design and implement solutions to ensure the scalability of our infrastructure to meet rapidly increasing demands. (15%)
Implement and manage monitoring systems to proactively identify issues and anomalies in our production environment. (10%)
Implement fault-tolerant and resilient design patterns to minimize service disruptions. (10%)
Build and maintain automation tools to streamline repetitive tasks and improve system reliability. (15%)
Participate in an on-call rotation to respond to critical incidents and ensure 24/7 system availability alongside other infrastructure developers. (5%)
Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure system reliability. (10%)

Job Requirements:Requires a Master’s degree (or foreign equivalent) in Computer Science, Information Technology, Electronic Engineering, or related field of study, plus 2 years of experience in the job offered, Software Engineer, or a related occupation.
Position requires at least 2 years of experience in the following skills:

CI/CD pipelines and automation using AWS and Kubernetes.
CI/CD using gitlab, docker, Kubernetes.
AWS infrastructure using Terraform, packer.
Certs LifeCycle Management using vault with PKI.
Python to automate every aspect of the pipeline.
Python and Shell Scripts for build automation.
Jenkins servers for continuous integration.

Staff Software Engineer - Reliability in Palo Alto, California at Luma AI

Explore Related Opportunities

Job Description

Scan to Apply

Job Location

Frequently asked questions about this position

Similar Jobs In Palo Alto, California

Senior Software Engineer, Backend Full Stack - Data Cloud

Director, Software Engineering

Salesforce API Architect

Full Stack Developer (Stealth Startup, Remote - US based)

SAP Technical Consultant (ABAP)