Skip to content

Case Study

Event streaming observability

A streaming analytics team running Kafka and Flink struggled with noisy alerts and unclear ownership. I delivered an observability revamp that tightened SLOs and cut MTTR.

Highlights

  • 60% reduction in alert noise with SLO-based alerting and runbooks.
  • MTTR under 15 minutes through better telemetry and on-call handoffs.
  • Unified logging pipeline for Kafka/Flink workloads into OpenSearch with retention tiers.

What I built

  • Prometheus + Grafana stack with Kafka exporter, JVM dashboards, and golden signals per service.
  • SLO definitions with burn-rate alerts routed by service ownership, backed by PagerDuty schedules.
  • Structured logging pattern and ingestion into OpenSearch with lifecycle policies.

Impact

  • Engineers focused on real incidents instead of triaging false positives.
  • Clear dashboards improved incident comms with product and data stakeholders.
  • Compliance teams gained retention/PII controls through structured logging.