Observability Engineer (Prometheus / Grafana / Datadog) in United States at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Observability Engineer (Prometheus / Grafana / Datadog) in United States.
This role is focused on building and operating the observability backbone that enables engineering teams to understand and trust complex distributed systems. You will design and maintain end-to-end telemetry pipelines across metrics, logs, and traces, ensuring high-quality signals at scale. Working closely with SRE, platform, and product engineering teams, you will turn noisy system data into actionable insights that improve reliability and performance. The environment is highly technical and cloud-native, requiring strong experience across modern observability stacks and SRE practices. You will help define standards for instrumentation, alerting, and SLO-driven operations across the organization. This is a high-impact role where your work directly shapes how systems are monitored, debugged, and improved in production.
- Design and operate large-scale observability platforms covering metrics, logs, traces, and synthetic monitoring using tools such as Prometheus, Grafana, Datadog, and OpenTelemetry, ensuring reliability, scalability, and usability across engineering teams.
- Define and enforce observability standards including instrumentation practices, metric naming conventions, structured logging, and distributed tracing approaches to ensure consistent telemetry quality.
- Build and maintain SLO/SLI frameworks, error budgets, and alerting systems that reduce noise while improving incident detection and operational response effectiveness.
- Manage high-volume time-series and log storage systems, optimizing for retention, performance, cost efficiency, and query reliability across distributed environments.
- Develop self-service tooling, dashboards, and reusable templates that enable product and platform teams to adopt observability best practices with minimal friction.
- Improve incident response workflows through better alerting, dashboards, runbooks, and post-incident analysis, while partnering closely with SRE and platform engineering teams.
- 5+ years of experience in SRE, platform engineering, or observability-focused roles, with hands-on ownership of production monitoring systems at scale.
- Strong expertise with Prometheus, Grafana, and at least one commercial observability platform such as Datadog, New Relic, or Splunk in production environments.
- Deep understanding of OpenTelemetry, distributed tracing, structured logging, and modern telemetry pipelines across cloud-native architectures.
- Strong programming skills in at least one language such as Go, Python, or Java, with the ability to build automation and observability tooling.
- Solid knowledge of SRE principles including SLOs, error budgets, incident management, and reliability engineering practices.
- Experience operating Kubernetes or container-based environments, with strong Linux, networking, and distributed systems fundamentals.
- Strong communication skills with the ability to influence engineering teams and drive adoption of observability standards.
- Competitive salary aligned with experience and market benchmarks
- Fully remote work across the United States
- Long-term, stable engagement with multi-year project scope
- Comprehensive healthcare coverage (medical, dental, and vision)
- Paid time off and standard leave benefits
- Opportunities to work with modern cloud-native and open-source observability technologies
- Career growth in a high-impact, platform-focused engineering environment