Observability Specialist in India at Jobgether
Explore Related Opportunities
Job Description
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for an Observability Specialist based in India.
This role is focused on building and scaling the observability backbone for complex, distributed enterprise platforms powering mission-critical ERP and AI-driven systems. You will design and implement end-to-end visibility solutions that ensure reliability, performance, and transparency across services, agents, and infrastructure. The position plays a key role in enabling engineering and operations teams to understand system behavior through logs, metrics, and distributed traces. You will work on establishing modern observability standards using OpenTelemetry and industry-leading monitoring tools. This is a highly technical and foundational role where your work directly impacts production stability, incident resolution, and customer trust. You will collaborate closely with engineering, DevOps, and platform teams across global environments. The environment is fast-paced, engineering-led, and focused on building highly reliable enterprise-scale systems.
- Design and implement scalable observability architecture using OpenTelemetry for distributed systems and AI-driven platforms.
- Build and maintain metrics, logging, and tracing infrastructure using tools such as Prometheus, Grafana, Jaeger, Loki, and related stacks.
- Define and enforce instrumentation standards across Java, Python, and web-based applications.
- Implement distributed tracing and context propagation across microservices, MCP workflows, and ERP system integrations.
- Develop dashboards, SLIs/SLOs, and alerting systems to monitor platform health, performance, and reliability.
- Create custom metrics and telemetry for AI agent behavior, LLM performance, and system-level insights.
- Design alerting strategies, escalation paths, and incident response workflows to reduce noise and improve reliability.
- Support root cause analysis and production troubleshooting using observability data and structured diagnostics.
- Bachelor’s degree in Computer Science or a related technical field.
- 5+ years of experience in SRE, observability, or platform engineering roles in distributed systems environments.
- Strong hands-on expertise with OpenTelemetry, including metrics, logs, and tracing.
- Experience with monitoring and visualization tools such as Prometheus, Grafana, and alerting frameworks.
- Strong knowledge of distributed tracing tools such as Jaeger, Zipkin, or equivalent systems.
- Experience with log aggregation tools like ELK stack, Loki, or similar solutions.
- Proficiency in Python, Java, or Go for instrumentation and automation.
- Strong understanding of SLI/SLO frameworks, alerting strategies, and incident management practices.
- Familiarity with Kubernetes observability, service mesh telemetry, and cloud-native architectures is a plus.
- Exposure to AI/ML observability, LLM monitoring, or enterprise ERP systems is highly valued.
- Strong analytical, debugging, and communication skills with experience working in distributed global teams.
- Opportunity to build core observability infrastructure for next-generation AI and ERP platforms.
- Exposure to large-scale distributed systems and enterprise-grade production environments.
- Work with modern observability and cloud-native technologies such as OpenTelemetry and Grafana stack.
- Strong career growth in platform engineering, SRE, and AI systems reliability domains.
- Remote-friendly environment with collaboration across global engineering teams.
- Opportunity to influence reliability standards and observability practices at scale.
- Continuous learning in advanced monitoring, tracing, and AI system diagnostics.