Back to selected work

Case study

Event streaming observability

A streaming analytics team running Kafka and Flink struggled with noisy alerts and unclear ownership. I delivered an observability revamp that tightened SLOs and cut MTTR.

Kafka Prometheus Grafana OpenSearch

01

Highlights

60% reduction in alert noise with SLO-based alerting and runbooks.
MTTR under 15 minutes through better telemetry and on-call handoffs.
Unified logging pipeline for Kafka/Flink workloads into OpenSearch with retention tiers.

02

What I built

Prometheus + Grafana stack with Kafka exporter, JVM dashboards, and golden signals per service.
SLO definitions with burn-rate alerts routed by service ownership, backed by PagerDuty schedules.
Structured logging pattern and ingestion into OpenSearch with lifecycle policies.

03

Impact

Engineers focused on real incidents instead of triaging false positives.
Clear dashboards improved incident comms with product and data stakeholders.
Compliance teams gained retention/PII controls through structured logging.

Next step

Want this kind of clarity in your platform?

Improve your observability