Over the years, observability has grown in ways I couldn’t have imagined when I first started working in this space. Thanks to OpenTelemetry, we now have a standard way to collect traces, metrics, and logs. Tools like Grafana, Prometheus, Jaeger and Elasticsearch make it easy to store and visualize that data.
But here’s the truth I keep coming back to:
Even with all the dashboards and alerts, something is still missing.
When incidents happen, engineers (including myself) often spend hours switching between tools, trying to connect dots, and figuring out why something broke. The last mile — turning raw signals into real understanding — isn’t solved yet.
How AI Changes the Game
This is where I see a big opportunity. With the rise of LLMs and AI, we can move beyond just showing data and start delivering narratives and explanations. Imagine if, instead of just throwing alerts at us, the system could:
- Explain anomalies in plain language
- Suggest likely root causes by looking across logs, metrics, and traces
- Let us query observability data directly through natural language
That’s how we turn observability data into operational intelligence.
The Challenge with Observability Data
Observability data naturally lives in different systems:
- Metrics in Prometheus
- Traces in Jaeger or Tempo
- Logs in Elastic or Loki
- Cloud-native data in CloudWatch, Stackdriver, or Datadog
Some modern platforms, like OpenHouse (built on ClickHouse), tackle this by providing specialized schemas for each type of data — one schema optimized for metrics, another for traces, another for logs. This approach allows all signals to be stored in a single backend while still respecting the differences between data types.
That works well if you’re ready to migrate all your observability data into a platform like OpenHouse. But migration isn’t always practical — especially if you already have Prometheus, Jaeger, and Elastic running in production.
This is where connectors come in. Instead of forcing everything into a single database, connectors let each data type stay in the system where it performs best, while still giving you:
- A unified way to query and correlate across tools
- The ability to reuse existing investments without migration
- Flexibility to evolve the stack without vendor lock-in
In other words, you don’t need “one database to rule them all.” With the right connectors, you can keep best-of-breed tools for each signal and still break down the silos when it matters most.
Breaking Down the Silos with Connectors
By building lightweight adapters that plug into these different tools, we can bring all that data into a single flow. Instead of treating each platform as an island, connectors act as bridges that unify the signals.
Why does this matter?
- It saves time during incidents — no more manual correlation across tabs and dashboards
- It makes root cause analysis easier by putting all signals in context
- It maximizes existing investments — you don’t need to replace tools, just integrate them
In short, integration tools can take away the heavy lifting. With the right connectors in place, organizations can stop wrestling with fragmented data and start focusing on what really matters: understanding what went wrong and fixing it faster.
My Take on the Design
If I were to design it, I’d keep it simple with two main layers:
- Data Connectors Layer → lightweight adapters to bring in data from any observability source
- Insight Engine Layer → powered by AI/ML to analyze events, detect anomalies, and explain them in human terms
The whole thing should be modular, extensible, and open — just like the ecosystem around OpenTelemetry.
My Closing Thought
The next wave in observability isn’t about adding more charts. It’s about AI-powered insight and self-healing.
Connectors can unify data across silos, while AI can not only pinpoint the issue but also drive automated recovery.
If OpenTelemetry gave us a standard way to instrument systems, what we need now is an open insight and action layer that closes the loop — from detection, to root cause, to recovery.
Note: In my next post, Building AI for Observability with AWS Bedrock, I explore how AWS Bedrock’s modular architecture and AI-powered insight engine directly address the design challenges discussed here.
Comments