Senior Network Site Reliability Engineer (NetSRE) in UK at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Network Site Reliability Engineer (NetSRE) in United Kingdom.
Join a highly technical, fast-scaling environment focused on building next-generation AI cloud infrastructure at global scale. In this role, you will help ensure the reliability, scalability, and operational excellence of mission-critical network systems that power advanced AI workloads and distributed platforms. Working at the intersection of networking, automation, and site reliability engineering, you’ll collaborate closely with infrastructure and platform teams to design resilient systems and optimize operational performance. This opportunity is ideal for engineers who enjoy solving complex infrastructure challenges, driving automation, and improving reliability through engineering-first practices. You’ll contribute to large-scale networking operations while influencing tooling, observability, incident response, and deployment strategies. The environment values ownership, innovation, collaboration, and continuous improvement across globally distributed teams.
- Define and manage reliability objectives for critical network services, including SLIs, SLOs, availability targets, and operational performance standards.
- Lead initiatives to improve overall network reliability across infrastructure, inter-site connectivity, and operational workflows.
- Own incident response processes for networking environments, conduct root cause investigations, and implement long-term corrective solutions.
- Design and enhance observability systems through metrics, logging, tracing, alerting, and monitoring improvements to accelerate troubleshooting and recovery.
- Build and maintain automation, CI/CD pipelines, testing environments, rollback mechanisms, and safe deployment processes for network changes.
- Collaborate with platform engineering and infrastructure teams to improve operability, scalability, and reliability of networking systems.
- Develop tooling and automation solutions using modern programming languages and infrastructure management practices.
- Support operational readiness and scalability initiatives for high-availability and high-throughput networking environments.
Requirements:
- Strong experience in Site Reliability Engineering, Network Engineering, or Infrastructure Engineering roles within large-scale production environments.
- Solid Linux systems administration expertise and proven ability to troubleshoot complex distributed systems.
- Strong understanding of networking fundamentals, including failure domains, latency, packet loss, control plane/data plane concepts, and high-availability architectures.
- Hands-on experience operating and improving reliable production systems through automation and engineering best practices.
- Proficiency in software development or scripting using Go, Python, or similar programming languages.
- Experience with infrastructure-as-code, CI/CD pipelines, containerized environments, and operational automation tools.
- Familiarity with observability, telemetry, monitoring systems, and incident management practices.
- Ability to work collaboratively across engineering teams while maintaining strong ownership and communication skills.
- Additional experience with eBPF/XDP, DPDK, large-scale network telemetry, NAT64, load balancing, or advanced networking performance optimization is considered a strong plus.
Benefits:
- Competitive compensation package.
- Flexible remote work options across Europe.
- Career development and continuous learning opportunities.
- Collaborative and engineering-driven work environment.
- Opportunity to contribute to cutting-edge AI infrastructure projects.
- Exposure to international teams and large-scale distributed systems.
- High-impact role with strong ownership and technical influence.
- Supportive culture focused on innovation, growth, and work-life balance.