Senior Site Reliability Engineer in Australia Fair, Queensland at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer in Australia.
This role sits at the core of a global cloud connectivity platform, ensuring that large-scale, distributed systems remain reliable, secure, and highly available. You will operate in a fast-paced, cloud-native environment where infrastructure resilience and automation are critical to business success. Working across international teams and time zones, you will help design, build, and maintain systems that support mission-critical services for global enterprises. This is a highly technical, hands-on engineering role that blends software development, infrastructure management, and operational excellence. You will play a key part in shaping SRE and DevOps practices while continuously improving system performance and reliability. The environment is collaborative, low-hierarchy, and innovation-driven, with a strong focus on customer impact.
- Improve production reliability, scalability, and resilience across distributed systems within an SRE-focused environment.
- Design, implement, and maintain automation solutions to reduce operational toil and prevent recurring incidents.
- Participate in incident response, on-call rotations, and blameless post-incident reviews to continuously improve system stability.
- Build and maintain observability systems, including metrics, logs, and tracing, ensuring actionable insights over noise.
- Develop and maintain infrastructure-as-code and CI/CD pipelines to support efficient and reliable deployments.
- Collaborate with cross-functional teams and stakeholders to gather requirements, share insights, and support technical decision-making.
- Write and maintain runbooks, tools, and automation scripts that support broader team operations.
- Contribute to system architecture decisions and help evolve platform engineering best practices across the organization.
- Work hands-on with Kubernetes-based infrastructure, cloud environments, and production systems in a highly distributed setup.
- 5+ years of experience administering Linux-based production systems in complex, distributed environments.
- Strong background in Site Reliability Engineering principles, including SLIs, SLOs, error budgets, and postmortem practices.
- Hands-on experience with Kubernetes and containerized infrastructure at scale.
- Solid cloud infrastructure experience, with AWS strongly preferred and exposure to bare-metal environments considered a plus.
- Strong programming and automation skills in Bash and at least one of Python or Go.
- Experience with infrastructure-as-code tools such as Terraform.
- Proficiency with CI/CD pipelines and version control systems, preferably GitHub.
- Experience operating and optimizing observability stacks (metrics, logs, traces) with a focus on signal quality.
- Database experience with technologies such as Postgres, Cassandra, or ClickHouse is advantageous.
- Strong troubleshooting skills with experience supporting live production systems and leading incident response.
- Collaborative mindset with the ability to work effectively in asynchronous, globally distributed teams.
- Continuous learning mindset with a strong drive for technical growth and improvement.
- Flexible working arrangements supporting remote and hybrid collaboration.
- Birthday leave as an additional paid day off.
- Generous learning and development budget with 5 days of paid study leave.
- Access to modern, creative, and collaborative work environments.
- Health and wellness programs designed to support employee wellbeing.
- Recognition programs including “Legend” and “Kudos” awards.
- Opportunity to work alongside highly skilled global engineering teams.
- Inclusive and supportive culture that values collaboration, curiosity, and innovation.