JobTarget Logo

Staff Software Engineer - Grafana Databases, Managed Services at Jobgether – UK

Jobgether
UK
Posted on
NewJob Function:Information Technology
New job! Apply early to increase your chances of getting hired.

Explore Related Opportunities

About This Position

Staff Software Engineer - Grafana Databases, Managed Services

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Staff Software Engineer – Grafana Databases, Managed Services in the United Kingdom.

In this role, you will operate at the intersection of large-scale distributed systems, streaming infrastructure, and cloud database platforms, helping power mission-critical observability services used globally. You will be responsible for the reliability, scalability, and performance of multi-cloud infrastructure that underpins high-throughput metrics, logs, and traces systems. Working in a deeply technical, remote-first engineering environment, you will influence architecture decisions while remaining hands-on in production systems. Your work will directly impact the stability and efficiency of large-scale data pipelines operating across hundreds of clusters. This is a high-autonomy role where you will partner with platform and database teams to solve complex distributed systems challenges. You will also play a key role in shaping operational excellence, reliability practices, and long-term system evolution across global infrastructure.

Accountabilities

In this role, you will take ownership of large-scale streaming and database infrastructure, ensuring reliability, scalability, and performance across hundreds of production clusters while driving architectural improvements and operational excellence.

  • Operate and evolve large-scale multi-cloud streaming and database infrastructure across production environments
  • Diagnose and resolve complex cross-layer failures involving storage, compute, networking, and control-plane systems
  • Design and implement safe rollout, upgrade, and migration strategies across distributed systems at scale
  • Improve observability, automation, and operational tooling to reduce system toil and increase reliability
  • Define and evolve SLOs, error budgets, and reliability standards for shared infrastructure systems
  • Partner with engineering teams to optimize query performance, data partitioning, and system scalability
  • Serve as a primary escalation point for high-severity incidents and lead deep root cause analysis efforts
  • Drive long-term architectural improvements to reduce systemic risks across multi-cluster environments
  • Mentor engineers and contribute to best practices in distributed systems engineering and operational excellence
Requirements

You bring deep expertise in distributed systems, infrastructure engineering, or platform engineering, with strong experience operating high-scale production systems in cloud environments. You are highly technical, autonomous, and comfortable leading complex initiatives across global teams.

  • 8+ years of software engineering experience in SRE, platform engineering, infrastructure, or distributed systems roles
  • Strong experience with large-scale streaming or database systems (e.g., Kafka, Redpanda, ClickHouse, Cassandra, or similar)
  • Hands-on expertise with Kubernetes in AWS, GCP, or Azure environments
  • Proficiency in infrastructure-as-code tools such as Terraform, Helm, or similar
  • Strong programming skills in systems-oriented languages (Go preferred)
  • Deep understanding of distributed systems behavior, failure modes, and performance trade-offs
  • Experience with observability, incident response, and writing post-incident reviews
  • Strong knowledge of Linux internals, networking, storage systems, and cloud architecture
  • Proven ability to lead technical initiatives and influence architectural decisions without formal authority
  • Excellent communication skills with the ability to work effectively in remote, cross-functional teams
Benefits
  • Competitive compensation package including base salary, bonus (where applicable), and equity (RSUs)
  • Fully remote-first working model with global collaboration across distributed teams
  • 30 days annual leave, including designated shutdown days for full disconnection
  • Equity ownership in the company’s long-term success through RSU participation
  • Access to modern AI development tools with company-supported usage budgets
  • Strong emphasis on autonomy, trust, and outcome-driven engineering culture
  • Career growth opportunities in a fast-scaling global infrastructure organization
  • Exposure to cutting-edge distributed systems and large-scale observability platforms
  • Inclusive, transparent, and highly collaborative engineering environment
How Jobgether works:
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#LI-CL1

Job Location

UK

Frequently asked questions about this position

Continue to apply
Enter your email to continue. You’ll be redirected to the employer’s application.
By clicking Continue, you understand and agree to JobTarget's Terms of Use and Privacy Policy.