Closing the Last Mile of Observability with AI

Over the years, observability has grown in ways I couldn’t have imagined when I first started working in this space. Thanks to OpenTelemetry, we now have a standard way to collect traces, metrics, and logs. Tools like Grafana, Prometheus, Jaeger and Elasticsearch make it easy to store and visualize that data.

But here’s the truth I keep coming back to:
Even with all the dashboards and alerts, something is still missing.

When incidents happen, engineers (including myself) often spend hours switching between tools, trying to connect dots, and figuring out why something broke. The last mile — turning raw signals into real understanding — isn’t solved yet.

How AI Changes the Game

This is where I see a big opportunity. With the rise of LLMs and AI, we can move beyond just showing data and start delivering narratives and explanations. Imagine if, instead of just throwing alerts at us, the system could:

Explain anomalies in plain language
Suggest likely root causes by looking across logs, metrics, and traces
Let us query observability data directly through natural language

That’s how we turn observability data into operational intelligence.

The Challenge with Observability Data

Observability data naturally lives in different systems:

Metrics in Prometheus
Traces in Jaeger or Tempo
Logs in Elastic or Loki
Cloud-native data in CloudWatch, Stackdriver, or Datadog

Some modern platforms, like OpenHouse (built on ClickHouse), tackle this by providing specialized schemas for each type of data — one schema optimized for metrics, another for traces, another for logs. This approach allows all signals to be stored in a single backend while still respecting the differences between data types.

That works well if you’re ready to migrate all your observability data into a platform like OpenHouse. But migration isn’t always practical — especially if you already have Prometheus, Jaeger, and Elastic running in production.

This is where connectors come in. Instead of forcing everything into a single database, connectors let each data type stay in the system where it performs best, while still giving you:

A unified way to query and correlate across tools
The ability to reuse existing investments without migration
Flexibility to evolve the stack without vendor lock-in

In other words, you don’t need “one database to rule them all.” With the right connectors, you can keep best-of-breed tools for each signal and still break down the silos when it matters most.

Breaking Down the Silos with Connectors

By building lightweight adapters that plug into these different tools, we can bring all that data into a single flow. Instead of treating each platform as an island, connectors act as bridges that unify the signals.

Why does this matter?

It saves time during incidents — no more manual correlation across tabs and dashboards
It makes root cause analysis easier by putting all signals in context
It maximizes existing investments — you don’t need to replace tools, just integrate them

In short, integration tools can take away the heavy lifting. With the right connectors in place, organizations can stop wrestling with fragmented data and start focusing on what really matters: understanding what went wrong and fixing it faster.

My Take on the Design

If I were to design it, I’d keep it simple with two main layers:

Data Connectors Layer → lightweight adapters to bring in data from any observability source
Insight Engine Layer → powered by AI/ML to analyze events, detect anomalies, and explain them in human terms

The whole thing should be modular, extensible, and open — just like the ecosystem around OpenTelemetry.

My Closing Thought

The next wave in observability isn’t about adding more charts. It’s about AI-powered insight and self-healing.

Connectors can unify data across silos, while AI can not only pinpoint the issue but also drive automated recovery.

If OpenTelemetry gave us a standard way to instrument systems, what we need now is an open insight and action layer that closes the loop — from detection, to root cause, to recovery.

Note: In my next post, Building AI for Observability with AWS Bedrock, I explore how AWS Bedrock’s modular architecture and AI-powered insight engine directly address the design challenges discussed here.

How AI Changes the Game#

The Challenge with Observability Data#

Breaking Down the Silos with Connectors#

My Take on the Design#

My Closing Thought#

Subscribe

Comments

How AI Changes the Game

The Challenge with Observability Data

Breaking Down the Silos with Connectors

My Take on the Design

My Closing Thought