Principal Site Reliability Engineer at Jobgether – United States
Explore Related Opportunities
About This Position
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Principal Site Reliability Engineer in United States.
This role offers the opportunity to lead reliability strategy for a complex, high-scale platform in a mission-driven, data-intensive environment. The Principal Site Reliability Engineer will own cross-cutting initiatives to ensure system reliability, scalability, and security while reducing operational toil. This position requires setting service-level objectives (SLOs), leading incident management, and designing automated workflows and infrastructure patterns. Success involves influencing architecture decisions, mentoring engineering teams, and establishing organization-wide operational standards. The role combines technical leadership with hands-on execution, directly impacting the delivery of secure, high-quality services for critical workflows. This is an ideal opportunity for someone passionate about system reliability, operational excellence, and cloud-based infrastructure.
- Serve as a technical leader for reliability across multiple domains, setting standards and maintaining hands-on involvement where necessary
- Define and maintain SLOs, error budgets, and reliability KPIs aligned to customer journeys
- Lead complex incident management, drive post-incident reviews, and implement remediation strategies
- Design and implement automation and self-service workflows to reduce operational toil and risk
- Scale infrastructure and platform operations using GitOps (Argo CD), Crossplane, Terraform, and cloud services (AWS)
- Conduct operational readiness and reliability reviews for new features and architectural changes
- Mentor Staff and Senior engineers, fostering best practices in reliability, performance, and security
Requirements:
- 8+ years of experience in SRE, platform engineering, systems engineering, or similar roles supporting production services at scale
- Demonstrated principal-level impact through cross-team initiatives and architecture influence
- Expertise in Kubernetes operations, troubleshooting, and safe deployment strategies
- Strong experience with GitOps workflows (Argo CD) and automation using Argo Workflows
- Infrastructure provisioning and orchestration skills with Crossplane and Terraform
- Deep AWS knowledge (IAM, networking, compute, storage, observability) and understanding of cloud failure modes
- Proficiency in Python for building automation and reliability improvements
- Strong incident management experience with measurable improvements in availability, MTTR, or operational maturity
- Excellent communication skills, translating technical trade-offs for diverse stakeholders
Benefits:
- Flexible, remote-friendly work environment
- Opportunities for personal and professional development
- Collaborative and mission-driven culture
- Participation in a talented, diverse, and energized engineering community
- Programs supporting employee growth, well-being, and engagement