OpenTelemetry won observability mindshare, but it is entirely the wrong architectural choice: by buying into its ethos your code is held hostage by the least stable otel monitoring library for your dependencies.
Sadly, there was always an alternative that no one took: dtrace. Add USDTs to your code and then monitor progress by instrumenting it externally, sending the resulting traces to wherever you want. My sincere hope is that the renewed interest in ebpf makes this a reality soon: I never want to have to do another from opentelemetry.<whatever> import <whatever> ever again.
If some tracing plug in is shitting up your code with its monkeypatching, rip it out and instrument it yourself. We do this a lot. I’d say Otel packages are no better or worse quality wise than any other stuff in node_modules. Not otel’s fault that Code in general has Bugs and is Bad.
Does any OpenTelemetry vendor have a dashboard / graph product in the same level of usability as Datadog?
Honeycomb is decent at what it has, but very limited offerings for dashboard.
Coming from Datadog, Grafana is such a bad experience I want to cry every time I try to build out a service dashboard. So much more friction to get anything done like adding transform functions / operators, do smoothing or extrapolation, even time shifting is like pulling teeth. Plus they just totally broke all our graphs with formulas for like 2 days.
dash0 is a brand new player, and wants to be "simple", perhaps check it out. Former colleagues of mine so I'm biased, but they do know what they're doing - they built the APM tool Instana, which sold to IBM for $400M.
The JavaScript Otel packages are implemented with 34 layers of extra abstraction. We wrote our own implementation of tracing for Cloudflare Workers and it performs much better with 0 layers of abstraction. I’ve seen a few other services switching over to our lightweight tracer. The emitted JSON is still chunky but removing all the incidental complexity helped a lot.
What the author doesn't realize is that OpenTelemetry has fundamental problems. I experienced this firsthand two years ago working with Otel in Rust, and just today I spent an entire afternoon debugging what turned out to be an otel package update breaking react-router links. Since the bug showed up with several other package updates at once, otel was in the bottom of my suspicion list.
The core issue is that, with otel, observability platforms become just a UI layer over a database. No one wants to invest in proper instrumentation, which is a difficult problem, so we end up with a tragedy of the commons where the instrumentation layer itself gets neglected as there is no money to be made there.
Most languages have pretty mature ecosystem, I used it in Go and it was mostly problem free, with biggest annoyance being a bit of boilerplate that had to be added
> The core issue is that, with otel, observability platforms become just a UI layer over a database. No one wants to invest in proper instrumentation, which is a difficult problem, so we end up with a tragedy of the commons where the instrumentation layer itself gets neglected as there is no money to be made there.
I don't think it's fair to say "no one wants to invest in proper instrumentation" - the OpenTelemetry community has built a massive amount of instrumentation in a relatively short period of time. Yes, OpenTelemetry is still young and unstable, but it's getting better every day.
As the article notes, the OpenTelemetry Collector has plugins can convert nearly any telemetry format to OTLP and back. Many of the plugins are "official" and maintained by employees of Splunk, Datadog, Snowflake, etc. Not only does it break the lock-in, but it allows you to reuse all the great instrumentation that's been built up over the years.
> The core issue is that, with otel, observability platforms become just a UI layer over a database.
I think this is a good thing - when everyone is on the same playing field (I can use Datadog instrumentation, convert it to OTel, then export it to Grafana Cloud/Prometheus), vendors will have to compete on performance and UX instead of their ability to lock us in with "golden handcuffs" instrumentation libraries.
The trade off is worth it in this case. Those are technical hurdles, when the issue we are trying to solve is data sovereignty then those hurdles become incidental complexity.
Of course you could also roll your own telemtry, which is generally no that difficult in a lot of frameworks. You don't always need something like OTEL.
You still have to do that work yourself. I am using honeycomb (the free tier) but their pricing makes little sense. Their margins must be something like x100.
I'd just like to point out that you've said OTel has fundamental problems, and then you pointed out a couple examples of one-time-fixable transient problems.
These are issues you'd experience with anything that spans your stack as a custom telemetry library would.
There is very much an alternative. Looking at the execution of your code should never alter its fundamental performance the way otel is built to do. This was a solved problem at least a decade and a half ago, but the cool kids decided to reinvent the wheel, poorly.
It's more than a couple. The fundamental issue is not the bugs themselves (these are expected) but that, from my perspective, otel is at odds with the observability business because these actors have little interest to contribute back to telemetry agents since anyone can reap the rewards of that. So instead they'd focus more on their platforms and the agents/libraries get neglected.
It's a great idea, in principle, but unless it gets strong backing from big tech, I think it'll fail. I'd love to be proven wrong.
> otel is at odds with the observability business because these actors have little interest to contribute back to telemetry agents since anyone can reap the rewards of that.
That's kind of how open source works, though. Of course the backend vendors won't care about anything that doesn't affect the backend somehow. But the people, i.e. users, who do want to be able to easily switch away from bad vendors, have incentives to keep things properly decoupled and working.
The license is the key enabler for all of this. The vendors can't be all that sneaky in the code they contribute without much higher risk of being caught. Sure, they will focus on the funnel that brings more data to them, but that leaves others more time to work on the other parts.
OpenTelemetry won observability mindshare, but it is entirely the wrong architectural choice: by buying into its ethos your code is held hostage by the least stable otel monitoring library for your dependencies.
Sadly, there was always an alternative that no one took: dtrace. Add USDTs to your code and then monitor progress by instrumenting it externally, sending the resulting traces to wherever you want. My sincere hope is that the renewed interest in ebpf makes this a reality soon: I never want to have to do another from opentelemetry.<whatever> import <whatever> ever again.
If some tracing plug in is shitting up your code with its monkeypatching, rip it out and instrument it yourself. We do this a lot. I’d say Otel packages are no better or worse quality wise than any other stuff in node_modules. Not otel’s fault that Code in general has Bugs and is Bad.
Does any OpenTelemetry vendor have a dashboard / graph product in the same level of usability as Datadog?
Honeycomb is decent at what it has, but very limited offerings for dashboard.
Coming from Datadog, Grafana is such a bad experience I want to cry every time I try to build out a service dashboard. So much more friction to get anything done like adding transform functions / operators, do smoothing or extrapolation, even time shifting is like pulling teeth. Plus they just totally broke all our graphs with formulas for like 2 days.
Grafana is to Datadog what Bugzilla is to Linear.
dash0 is a brand new player, and wants to be "simple", perhaps check it out. Former colleagues of mine so I'm biased, but they do know what they're doing - they built the APM tool Instana, which sold to IBM for $400M.
I was always turned down to use more Otel by how verbose it is and how heavy are the telemetry payload compared to simple adhoc alternatives.
Am I wrong?
The JavaScript Otel packages are implemented with 34 layers of extra abstraction. We wrote our own implementation of tracing for Cloudflare Workers and it performs much better with 0 layers of abstraction. I’ve seen a few other services switching over to our lightweight tracer. The emitted JSON is still chunky but removing all the incidental complexity helped a lot.
It is designed to get insights into dozen(s) connected systems.
It will always be overkill for just an app or two talking with eachother... till you grow, then it won't be overkill any more.
But still might be worth getting into it on smaller apps just thanks to wealth of tools available.
No, the spec isn't great and makes it hard to implement a performant solution.
What the author doesn't realize is that OpenTelemetry has fundamental problems. I experienced this firsthand two years ago working with Otel in Rust, and just today I spent an entire afternoon debugging what turned out to be an otel package update breaking react-router links. Since the bug showed up with several other package updates at once, otel was in the bottom of my suspicion list.
The core issue is that, with otel, observability platforms become just a UI layer over a database. No one wants to invest in proper instrumentation, which is a difficult problem, so we end up with a tragedy of the commons where the instrumentation layer itself gets neglected as there is no money to be made there.
Most languages have pretty mature ecosystem, I used it in Go and it was mostly problem free, with biggest annoyance being a bit of boilerplate that had to be added
> The core issue is that, with otel, observability platforms become just a UI layer over a database. No one wants to invest in proper instrumentation, which is a difficult problem, so we end up with a tragedy of the commons where the instrumentation layer itself gets neglected as there is no money to be made there.
I don't think it's fair to say "no one wants to invest in proper instrumentation" - the OpenTelemetry community has built a massive amount of instrumentation in a relatively short period of time. Yes, OpenTelemetry is still young and unstable, but it's getting better every day.
As the article notes, the OpenTelemetry Collector has plugins can convert nearly any telemetry format to OTLP and back. Many of the plugins are "official" and maintained by employees of Splunk, Datadog, Snowflake, etc. Not only does it break the lock-in, but it allows you to reuse all the great instrumentation that's been built up over the years.
> The core issue is that, with otel, observability platforms become just a UI layer over a database.
I think this is a good thing - when everyone is on the same playing field (I can use Datadog instrumentation, convert it to OTel, then export it to Grafana Cloud/Prometheus), vendors will have to compete on performance and UX instead of their ability to lock us in with "golden handcuffs" instrumentation libraries.
> instrumentation layer itself gets neglected
It needs to be treated as an integral part of whatever framework is being instrumented. And maintained by those same people.
The trade off is worth it in this case. Those are technical hurdles, when the issue we are trying to solve is data sovereignty then those hurdles become incidental complexity.
Of course you could also roll your own telemtry, which is generally no that difficult in a lot of frameworks. You don't always need something like OTEL.
I would add that in most cases, it's just a web UI displaying a lot of noise disguised as data.
Making sense out of so much data is why datadog and sentry make so much money.
You still have to do that work yourself. I am using honeycomb (the free tier) but their pricing makes little sense. Their margins must be something like x100.
Having shipped free-tier observability products, your comment that you aren't paying them but think their margins are 100x is a perfect irony.
I'd just like to point out that you've said OTel has fundamental problems, and then you pointed out a couple examples of one-time-fixable transient problems.
These are issues you'd experience with anything that spans your stack as a custom telemetry library would.
There is very much an alternative. Looking at the execution of your code should never alter its fundamental performance the way otel is built to do. This was a solved problem at least a decade and a half ago, but the cool kids decided to reinvent the wheel, poorly.
https://news.ycombinator.com/item?id=45845889
It's more than a couple. The fundamental issue is not the bugs themselves (these are expected) but that, from my perspective, otel is at odds with the observability business because these actors have little interest to contribute back to telemetry agents since anyone can reap the rewards of that. So instead they'd focus more on their platforms and the agents/libraries get neglected.
It's a great idea, in principle, but unless it gets strong backing from big tech, I think it'll fail. I'd love to be proven wrong.
> otel is at odds with the observability business because these actors have little interest to contribute back to telemetry agents since anyone can reap the rewards of that.
But all major vendors _do_ contribute to OTEL.
That's kind of how open source works, though. Of course the backend vendors won't care about anything that doesn't affect the backend somehow. But the people, i.e. users, who do want to be able to easily switch away from bad vendors, have incentives to keep things properly decoupled and working.
The license is the key enabler for all of this. The vendors can't be all that sneaky in the code they contribute without much higher risk of being caught. Sure, they will focus on the funnel that brings more data to them, but that leaves others more time to work on the other parts.