Can I apply directly for this job on this page?

You can apply directly on this page using our quick application form. Your application is sent immediately to the hiring team, and no account is required.

What is the role of a Manager, Reliability Operations at CloudOne Digital?

The Manager, Reliability Operations position at CloudOne Digital is a Full-time or part-time position opportunity in the Executive/Management field.

Where is this Manager, Reliability Operations job located?

Lansing, Michigan, 48917, United States

What type of employment is offered for this Manager, Reliability Operations role?

Full-time or part-time position

What is the expected salary for this Manager, Reliability Operations job?

Compensation will be discussed during the hiring process.

Manager, Reliability Operations job near me in Lansing, Michigan at CloudOne Digital

About Nexcess

Nexcess provides specialty cloud solutions for organizations where performance and compliance have to coexist. We serve businesses worldwide, from agencies scaling client sites to enterprises running mission-critical operations. We've built our reputation on deep technical expertise and genuine partnership with every client we work with. Behind every environment we manage is a team of people who take the craft seriously and keep showing up when it matters.

About the Role

We’re looking for a Manager of Reliability Operations to lead how we detect, respond to, and learn from failures across our platform ecosystem.

This role sits at the intersection of Operations and Engineering, bringing structure to incident response, accountability to follow-through, and clarity to reliability insights. You’ll ensure that what we learn from production directly improves how our platforms are built, operated, and scaled.

This is a permanent, full-time, remote position.

US Pay Band - $110K - $150K Actual compensation will vary based on experience, skills, and location.

What You’ll DoOwn Reliability Operations & Incident Command

Continuously evolve and improve incident management, change management, and post-incident practices
Establish clear standards for incident declaration, severity, escalation, and communication
Ensure consistent execution across teams and continuous process improvement

Own the incident command function, including roles, structure, and operating procedures
Lead or oversee major incident response in a 24/7 production environment
Build and manage on-call incident commander rotations with global coverage

Drive Learning, Accountability & Reliability Strategy

Own post-incident reviews, ensuring strong root cause analysis and clear documentation
Translate incident trends into actionable reliability improvements
Drive completion of corrective actions across teams; escalate when needed
Define and maintain service performance and reliability targets (availability, latency, error rates)

Own observability strategy, including monitoring, alerting, and signal quality
Improve detection, reduce time to resolution, and increase platform resilience

Partner with Engineering and Operations on capacity planning, patching, and lifecycle decisions
Ensure reliability insights directly inform platform and infrastructure roadmaps
Collaborate with Security on vulnerability response, patch prioritization, and compliance alignment

Operate Across a Complex Platform Environment

Work across environments including virtualization platforms (VMware), distributed storage (Ceph), Linux-based systems, and hybrid cloud infrastructure
Support platforms that span dedicated hosting, managed applications, and high-availability cloud services
Ensure reliability practices scale across multiple products, brands, and customer environments

Provide regular, data-driven reporting to leadership on availability, incident trends, and operational performance
Act as the central authority on reliability insights across teams

What You Bring

Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
7+ experience in systems operations, site reliability, or platform engineering
2+ years experience leading teams or major operational functions
Proven experience managing incidents in a 24/7 production environment
Strong background in troubleshooting, root cause analysis, and operational improvement
Experience with change management practices

Platform & Tooling Experience

Monitoring and observability platforms (e.g., Datadog, Prometheus, Grafana, New Relic)
Incident management and alerting tools (e.g., PagerDuty, Opsgenie)
Infrastructure and platform technologies (Linux systems, VMware, Ceph, cloud platforms)
Logging and telemetry systems (centralized logging, metrics, tracing)
Ability to translate complex technical data into clear insights
Strong communication skills, especially in high-pressure situations

Nice to Have

Background in Computer Science, Engineering, or a related field
Experience in managed hosting, cloud infrastructure, or SaaS environments
Experience defining and tracking system reliability and performance targets
Familiarity with ITIL or similar operational frameworks
Exposure to VMware, Ceph, Linux, and Windows platforms
Relevant certifications (AWS, RHCE, etc.)

What We Offer

Comprehensive benefits package
Traditional and Roth 401(k) with company matching
A collaborative, team-oriented culture
Consistent and predictable work hours
Engaging, varied work that keeps each day different
Opportunities to contribute ideas and influence how work gets done

Disclaimer:

This job description is only a summary of the typical functions of the position. It is not intended to be an exhaustive or comprehensive list of all job responsibilities, tasks, or duties. Additional duties and tasks may be assigned as part of the job function. Nexcess reserves the right to modify, interpret, or apply this job description in a way that best supports the organizational needs. The job description in no way creates or implies an employment contract. The employment contract remains “at will”.

Equal Employment Opportunity Policy:

Nexcess is committed to offering equal employment opportunity without regard to age, color, disability, gender, gender identity, genetic information, marital status, military status, national origin, race, religion, sexual orientation, veteran status, or any other legally protected characteristic.

#LI-Remote

Manager, Reliability Operations in Lansing, Michigan at CloudOne Digital

Explore Related Opportunities

Job Description

Scan to Apply

Job Location

Frequently asked questions about this position

Similar Jobs In Lansing, Michigan

Copy of Purchasing Manager

NP/PA Day Shift! No nights, weekends or holidays!

Anodize Control

Service Tech

BACA Wet Saw Operator - 3rd Shift

You might also be interested in

R&D Scientist I

3rd Shift Commercial Tire Technician

Maintenance Technician IV

Medical Receptionist

Senior Manager of Resolutions