Case Study
Event streaming observability
A streaming analytics team running Kafka and Flink struggled with noisy alerts and unclear ownership. I delivered an observability revamp that tightened SLOs and cut MTTR.
Highlights
- 60% reduction in alert noise with SLO-based alerting and runbooks.
- MTTR under 15 minutes through better telemetry and on-call handoffs.
- Unified logging pipeline for Kafka/Flink workloads into OpenSearch with retention tiers.
What I built
- Prometheus + Grafana stack with Kafka exporter, JVM dashboards, and golden signals per service.
- SLO definitions with burn-rate alerts routed by service ownership, backed by PagerDuty schedules.
- Structured logging pattern and ingestion into OpenSearch with lifecycle policies.
Impact
- Engineers focused on real incidents instead of triaging false positives.
- Clear dashboards improved incident comms with product and data stakeholders.
- Compliance teams gained retention/PII controls through structured logging.