JobTarget Logo

Site Observability Engineer in United States at Jobgether

NewJob Function: Engineering
Jobgether
United States, United States
Posted on
New job! Apply early to increase your chances of getting hired.

Explore Related Opportunities

Job Description

Site Observability Engineer

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Observability Engineer based in United States.

This role is central to ensuring engineering teams have full visibility into system health, performance, and reliability across complex distributed environments. The engineer will design and operate end-to-end observability platforms covering metrics, logs, traces, and events, enabling fast and accurate detection of issues before they impact users. The environment is highly technical, cloud-native, and deeply aligned with SRE principles, with strong emphasis on automation, scalability, and signal quality. The role involves shaping how telemetry is collected, stored, and transformed into actionable insight across the organization. It also requires close collaboration with platform, SRE, and product engineering teams to embed observability into every layer of the system. The position is ideal for someone passionate about reliability engineering, data-driven operations, and building systems that empower others to debug and improve production services.

Accountabilities

This role is responsible for building, operating, and evolving the organization’s observability ecosystem, ensuring engineers can effectively monitor, troubleshoot, and improve distributed systems at scale.

  • Design and operate enterprise-grade observability platforms across metrics, logs, traces, and events
  • Architect and manage tools such as Prometheus, Thanos, Mimir, Grafana, Loki, Tempo, OpenTelemetry, and Datadog
  • Define and enforce SLOs, SLIs, error budgets, and observability standards across teams
  • Build alerting frameworks integrated with on-call systems to reduce noise and improve incident response
  • Develop instrumentation standards including logging formats, metric naming, and trace propagation
  • Manage large-scale telemetry pipelines with a focus on performance, retention, and cost optimization
  • Build dashboards and self-service tools to improve observability adoption across engineering teams
  • Improve incident response readiness through better alerting, monitoring, and post-incident analysis
  • Partner with SRE and platform teams to embed observability into CI/CD and deployment workflows
  • Mentor engineers on observability best practices, debugging techniques, and reliability engineering principles
Requirements:

The ideal candidate brings deep experience in observability, SRE practices, and distributed systems, with strong technical and communication skills to drive adoption across engineering teams.

  • 5+ years of experience in SRE, platform engineering, or observability-focused roles
  • Strong hands-on expertise with Prometheus, Grafana, and at least one commercial tool (Datadog, New Relic, or Splunk)
  • Solid understanding of OpenTelemetry, distributed tracing, and structured logging
  • Proficiency in at least one programming language such as Go, Python, or Java
  • Experience operating high-scale metrics and log pipelines with high cardinality
  • Strong knowledge of SLOs, SLIs, error budgets, and reliability engineering principles
  • Experience integrating observability systems with CI/CD and incident management tools
  • Solid understanding of Linux systems, networking, and containerized environments
  • Strong troubleshooting, analytical, and communication skills
  • Experience in building or scaling observability platforms is highly valued
Benefits:
  • Competitive salary range ($100K–$150K based on experience)
  • 100% remote work within the United States
  • Full-time W2 employment structure (no C2C or 1099 arrangements)
  • Health, dental, and vision insurance options
  • Paid time off and company holidays
  • Retirement savings plan with employer contributions
  • Professional development and career growth opportunities
  • Exposure to modern cloud-native observability stacks and large-scale distributed systems
  • Collaborative engineering culture focused on reliability and continuous improvement
How Jobgether works:
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#LI-CL1

Job Location

United States, United States

Frequently asked questions about this position

Continue to apply
Enter your email to continue. You’ll be redirected to the employer’s application.
By clicking Continue, you understand and agree to JobTarget's Terms of Use and Privacy Policy.