Senior Site Reliability Engineer in Canada Creek, Nova Scotia at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer in Canada.
This role sits at the core of a fast-scaling, AI-driven intelligence platform, where reliability is not just operational support but a strategic enabler of product innovation. You will design and own the foundations that ensure large-scale, mission-critical systems remain observable, resilient, and performant under demanding AI and data workloads. Acting as a senior individual contributor, you will shape reliability standards, SLO frameworks, and multi-region architecture while directly influencing engineering decisions across the organization. The environment is highly technical, collaborative, and innovation-focused, with a strong emphasis on AI-native systems and automation-first thinking. You will work across software, AI engineering, and platform teams to ensure seamless delivery of complex services. This is a hands-on leadership role for someone who wants to define how modern AI infrastructure operates at scale.
- You will define and own service reliability standards, including SLOs, SLIs, and error budgets, ensuring consistent performance across all production systems.
- You will design and implement reliability patterns for AI agent pipelines, including observability, failure detection, and safe degradation mechanisms.
- You will architect and improve multi-region infrastructure strategies, driving high availability, disaster recovery readiness, and blast radius control.
- You will lead incident response and postmortem processes, ensuring durable fixes and continuous improvement of system resilience.
- You will serve as the primary reliability partner for engineering and AI teams, influencing architecture, deployment strategies, and system design decisions.
- You will own observability and platform tooling, including service catalog management, Datadog configuration, and AI workload monitoring.
- You will develop CI/CD standards and enable self-service developer platforms to improve deployment velocity and system reliability.
- You will contribute to FinOps initiatives by improving cost visibility and optimizing infrastructure efficiency across cloud environments.
- You bring 6–8+ years of experience in Site Reliability Engineering, DevOps, or platform engineering, with senior-level technical ownership responsibilities.
- You have deep expertise in AWS and distributed systems architecture, including multi-region, high-availability environments.
- You are highly skilled in Kubernetes, Docker, Terraform, and GitOps practices, with strong infrastructure-as-code experience.
- You have hands-on experience with observability platforms such as Datadog, including SLO monitoring, alerting, tracing, and log analytics.
- You are proficient in scripting and development (Python and/or Bash), with solid understanding of microservices architectures.
- You have strong experience designing and optimizing CI/CD pipelines (e.g., GitHub Actions, Bitbucket Pipelines).
- You understand reliability challenges in large-scale systems and can translate complex technical risks into actionable engineering solutions.
- You have strong communication and collaboration skills, with the ability to influence cross-functional teams and mentor engineers.
- Experience with AI/ML infrastructure, LLM systems, or agent-based architectures is a strong advantage.
- Competitive compensation in the range of $125,200 – $132,500 CAD.
- Comprehensive benefits package including health, dental, vision, and wellness coverage.
- RRSP matching and annual fitness reimbursement.
- Flexible vacation policy and remote-first work arrangement within Canada.
- Access to professional training, development programs, and high-growth career opportunities.
- Wellness resources and employee support programs.
- Inclusive, diverse, and accessibility-focused work environment.
- Opportunities to work on cutting-edge AI and large-scale data infrastructure systems.