Product Reliability Engineer in Canada Creek, Nova Scotia at Jobgether
Explore Related Opportunities
Job Description
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Product Reliability Engineer based in Canada.
This role sits at the critical intersection of software engineering, customer reliability, and production operations for infrastructure software deployed in complex, real-world environments. You will ensure that production systems running in customer-owned Kubernetes environments remain stable, observable, and continuously improvable. The work goes beyond incident response, focusing on eliminating entire categories of failures through better tooling, automation, and product design. You will partner closely with customers, engineers, and solution teams to investigate complex issues, drive root-cause analysis, and translate findings into long-term system improvements. This is a highly hands-on role where debugging, automation, and product thinking come together to define reliability as a core product capability. Your work will directly shape how enterprise customers experience stability, performance, and trust in the platform.
- Partner with customers and internal teams to investigate and resolve complex production issues across Kubernetes-based on-prem and hybrid deployments.
- Lead deep root-cause analysis for escalations, reproduce issues, and collaborate with engineering teams to implement durable fixes.
- Build and maintain reliability tooling such as diagnostics systems, health checks, support bundles, and environment validation utilities.
- Own and improve test automation frameworks, focusing on CI stability, reducing flaky tests, and strengthening integration and end-to-end coverage.
- Define and maintain performance baselines, regression testing frameworks, and reliability gates to prevent production regressions.
- Improve installation, upgrade, and deployment reliability by identifying recurring failure patterns and building preventive solutions.
- Develop production-grade internal tools and product enhancements using Python, Go, or Rust to strengthen observability and system resilience.
- Establish a closed feedback loop from customer issues to engineering improvements in testing, observability, documentation, and defaults.
- 4–7 years of experience in production engineering, SRE, platform engineering, or similar roles focused on reliability and distributed systems.
- Strong software engineering fundamentals, including debugging, testing, system design, and production-grade coding practices.
- Hands-on Kubernetes expertise, including troubleshooting workloads, networking, storage, RBAC, and multi-environment deployments.
- Strong experience with observability tools and techniques, including logs, metrics, and tracing for distributed system debugging.
- Proficiency in at least one programming language such as Python, Go, or Rust, with experience building internal tools or production systems.
- Strong analytical and communication skills, with the ability to break down complex incidents into clear root causes and actionable recommendations.
- Experience working in cross-functional environments with engineering, product, and customer-facing teams in fast-moving contexts.
- Self-directed and comfortable working in remote-first environments with shifting priorities driven by customer needs and escalations.
- Competitive compensation package aligned with experience and seniority
- Fully remote work environment across Canada and the United States
- Opportunity to work on real-world production infrastructure used in complex enterprise environments
- Strong technical ownership with high impact on product reliability and customer experience
- Collaboration with experienced engineers in infrastructure, automation, and platform engineering
- Learning and growth opportunities in Kubernetes, observability, and large-scale distributed systems
- Inclusive and diverse team culture focused on collaboration and continuous improvement
- Exposure to open-source-driven infrastructure innovation