Senior Site Reliability Engineer, Wikimedia Enterprise in Czechia at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer, Wikimedia Enterprise in Czechia.
This role sits at the intersection of large-scale infrastructure engineering and mission-driven technology powering global knowledge distribution systems. You will help design, operate, and evolve highly available, high-performance API and data infrastructure that supports large-scale reuse of Wikimedia content worldwide. The position involves deep technical ownership of reliability, scalability, and observability for critical services. You will work in a fully distributed, globally collaborative environment alongside experienced SREs, software engineers, and platform teams. The role combines hands-on engineering, incident response, and long-term reliability strategy. It also offers the opportunity to contribute to systems that directly impact how knowledge is accessed and reused across the internet. You will operate in a fast-paced, product-focused engineering culture with strong emphasis on automation, experimentation, and continuous improvement.
In this role, you will be responsible for ensuring the reliability, scalability, and performance of large-scale distributed systems that power data and API services. You will:
- Define, track, and continuously improve SLOs, SLIs, and error budgets for critical services
- Design and enhance observability systems including metrics, logging, and distributed tracing
- Participate in incident response, on-call rotations, and post-incident reviews to drive continuous improvement
- Build and maintain CI/CD and GitOps pipelines enabling secure, automated, and reliable deployments
- Implement infrastructure-as-code and automation-first practices to reduce operational toil
- Design and operate scalable cloud infrastructure across production environments
- Drive capacity planning, performance optimization, and resilience testing (including chaos engineering practices)
- Improve developer experience by enabling self-service infrastructure and streamlined workflows
- Collaborate with security, software, and release engineering teams to embed reliability and security best practices
- Optimize infrastructure cost and efficiency using FinOps principles without compromising availability
- Develop and maintain operational metrics such as MTTR, MTTD, and incident frequency
- Contribute to platform engineering initiatives that standardize infrastructure across teams
- Mentor peers and promote best practices in SRE, automation, and systems reliability
This position requires strong expertise in site reliability engineering, distributed systems, and cloud infrastructure, along with a proactive and collaborative mindset. You should have:
- 5+ years of experience in SRE, DevOps, or infrastructure engineering roles
- Strong experience with infrastructure-as-code tools such as Terraform and/or Ansible
- Proficiency in at least one programming language (Python, Go, or similar)
- Hands-on experience with cloud platforms such as AWS, GCP, or Azure
- Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab, ArgoCD or similar tools)
- Strong understanding of SRE principles including SLOs, SLIs, and error budgets
- Experience with observability tooling such as Prometheus, OpenTelemetry, or equivalent
- Proven experience in incident response, on-call operations, and postmortem analysis
- Ability to operate and optimize large-scale distributed systems with high availability requirements
- Strong communication and collaboration skills in distributed, remote-first environments
- Ability to document systems clearly and contribute to shared engineering knowledge
- Strong ownership mindset, with a focus on automation, reliability, and continuous improvement
- Adaptability to fast-evolving, technology-driven environments
- Remote-first work model with global collaboration
- Opportunity to work on high-impact systems supporting global knowledge platforms
- Exposure to large-scale distributed systems and modern cloud-native architectures
- Culture of engineering excellence, automation, and continuous improvement
- Strong emphasis on learning, experimentation, and open collaboration
- Competitive compensation adjusted to location and experience
- Inclusive and diverse work environment with global team exposure
- Opportunity to contribute to open knowledge infrastructure used worldwide