SRE/DevOps Engineer in Canada Creek, Nova Scotia at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a SRE/DevOps Engineer in Canada.
This role sits at the frontline of enterprise platform reliability, ensuring the stability, availability, and performance of large-scale cloud and hybrid systems. You will act as the first line of response for incidents across modern infrastructure environments, including Kubernetes, APIs, databases, and cloud-native services. Working in a highly operational and collaborative setting, you will monitor systems, execute runbooks, and support rapid incident resolution to minimize downtime. The position combines hands-on technical troubleshooting with structured operational processes, where precision and communication are critical. You will contribute directly to service reliability by identifying issues, escalating intelligently, and improving documentation and automation opportunities. This is a high-impact role ideal for professionals who thrive in fast-paced, incident-driven environments and enjoy keeping complex systems running smoothly.
- Monitor system health across cloud and on-prem environments using observability tools such as dashboards, logs, and alerting systems.
- Perform first-line incident triage, identify system anomalies, and execute standardized runbooks for resolution or escalation.
- Troubleshoot application and infrastructure issues across Kubernetes, APIs, databases, and cloud services to isolate root causes.
- Communicate incident status clearly and effectively to stakeholders, ensuring timely updates and accurate reporting.
- Support deployment operations and routine tasks by following predefined operational procedures and workflows.
- Document incidents, identify gaps in runbooks, and contribute to continuous improvement of operational knowledge bases.
- Assist in onboarding new applications into operational monitoring and support frameworks.
- Collaborate with engineering and L2/L3 teams to ensure smooth escalation and resolution of complex issues.
- 2–5 years of experience in IT operations, NOC, SRE, or DevOps-related roles.
- Strong understanding of Linux, Kubernetes basics, and networking fundamentals.
- Experience working with observability tools such as Prometheus, Grafana, Splunk, ELK, or similar platforms.
- Ability to follow structured operational workflows, including runbooks and incident management procedures.
- Basic scripting knowledge in Python, Bash, or PowerShell for minor automation or script adjustments.
- Familiarity with cloud platforms such as AWS, Azure, or GCP is a strong plus.
- Understanding of troubleshooting techniques (DNS, logs, connectivity checks, networking tools).
- Strong analytical and problem-solving mindset with a focus on incident resolution and root cause identification.
- Effective communication skills for incident reporting and stakeholder updates.
- Nice to have: exposure to ServiceNow, Jira, xMatters, SQL/NoSQL basics, or AI-assisted operational tools.
- Competitive compensation aligned with experience and technical expertise.
- Flexible working arrangements depending on role and location.
- Comprehensive health and wellness support programs.
- Opportunities for continuous learning, upskilling, and career development.
- Exposure to large-scale cloud-native and enterprise systems.
- Inclusive and diverse work environment focused on collaboration and innovation.
- Strong emphasis on work-life balance and employee well-being.
- Access to modern tools, platforms, and automation-driven operations practices.