Lead Observability Engineer at Kobie Marketing – Bengaluru, Karnātaka
Explore Related Opportunities
About This Position
About the Team and What We’ll Build Together
You are a Lead Observability Engineer who will drive the strategy, adoption, and evolution of observability across all production and delivery environments. You will play a critical role in ensuring system reliability, performance visibility, and proactive issue resolution across our platforms.
You will operate at the intersection of Engineering, DevOps, and Production Support, bringing structure, standardization, and intelligence to how we monitor and manage systems. You will lead the shift from reactive operations to proactive, AI-driven observability and automated reliability.
In this role, you will:
- Own and evolve the observability platform (e.g., New Relic) to provide end-to-end visibility across applications and infrastructure
- Establish standards for monitoring, alerting, dashboards, and telemetry (logs, metrics, traces)
- Leverage AIOps capabilities to improve anomaly detection, reduce noise, and accelerate root cause analysis
- Drive automation and self-healing workflows to minimize manual intervention and improve system resilience
- Collaborate across teams to ensure systems are observable by design and aligned with reliability goals
- Continuously analyze system behavior and incident patterns to improve performance, scalability, and uptime
You will be part of a team focused on building a highly reliable, data-driven, and scalable operational ecosystem, where observability is a core foundation for engineering excellence.
Lead the observability strategy and execution, ensuring comprehensive visibility across all production and delivery environments.
· Own and govern the enterprise observability platform (New Relic or equivalent tools such as Datadog or Dynatrace) and ensure consistent monitoring standards across systems.
· Explore and adopt AI-driven monitoring capabilities (AIOps) to automate anomaly detection, reduce alert fatigue, and enable predictive problem management.
· Collaborate closely with Production Support (L1/L2), DevOps, CloudOps, Software Engineering, and Database teams to triage complex production issues and accelerate incident resolution.
· Act as the operational coordinator during service-impacting events, organizing workflows, managing cross-team dependencies, and providing structured updates to leadership.
· Design and implement automated remediation workflows and self-healing mechanisms for recurring incidents.
· Analyze telemetry data (logs, metrics, traces) to identify incident patterns and systemic anomalies, and continuously refine alert thresholds and routing logic.
· Develop and maintain dynamic dashboards that reflect real-time system health, application performance, and infrastructure behavior.
· Define and track reliability metrics such as SLOs, SLIs, MTTD, and MTTR to improve service reliability.
· Ensure clear, timely communication with stakeholders during incidents and operational events.
· Drive organization-wide adoption of observability best practices through documentation, training, and knowledge sharing.
8–10+ years of experience in observability, site reliability engineering (SRE), DevOps, or advanced production operations in large-scale enterprise environments.
· Expert-level hands-on experience implementing and optimizing observability platforms such as New Relic, Datadog, Dynatrace, or Splunk.
· Strong understanding of monitoring fundamentals including logs, metrics, traces, and alerting strategies.
· Experience working with cloud-native architectures (AWS preferred).
· Familiarity with containerized environments and orchestration platforms such as Kubernetes.
· Experience integrating observability practices into CI/CD pipelines to ensure applications are observable by design.
· Strong understanding of incident management, problem management, and change management practices (ITIL concepts).
· Demonstrated ability to analyze telemetry data to identify patterns, detect anomalies, and improve operational reliability.
· Strong leadership and collaboration skills with the ability to coordinate across engineering, DevOps, and operations teams.
· Excellent communication skills and a strong focus on operational excellence and continuous improvement.
Nice to Have
· Experience implementing AI/ML capabilities within observability tools for anomaly detection and predictive monitoring.
· Familiarity with AIOps platforms and automated remediation workflows.
· Experience with event streaming platforms such as Kafka for telemetry ingestion or real-time data processing.
· Basic understanding of application architecture and troubleshooting distributed systems.
· Experience with automation frameworks or serverless workflows (e.g., AWS Lambda, scripting, or infrastructure automation).