LLM Observability in the Wild – Why OpenTelemetry Should Be the Standard

(signoz.io)

76 points | by pranay01 9 hours ago ago

26 comments

olliem36 5 hours ago

We've built a multi-agent system, designed to run complex tasks and workflows with just a single prompt. Prompts are written by non-technical people, can be 10+ pages long...

We've invested heavily in observability having quickly found that observability + evals are the cornerstone to a successful agent.

For example, a few things measure:

1. Task complexity (assessed by another LLM) 2. Success metrics given the task(s) (Agin by other LLMS) 3. Speed of agent runs & tools 4. Errors of tools, inc time outs. 5. How much summarizaiton and chunking occurs between agents and tool results 6. tokens used, cost 7. reasoning, model selected by our dynamic routing..

Thank god its been relatively cheap to build this in house.. our metrics dashboard is essentially a vibe coded react admin site.. but proves absolutely invaluable!

All of this happed after a heavy investment in agent orchestration, context management... it's been quite a ride!

[-]

apwell23 4 hours ago

> Prompts are written by non-technical people, can be 10+ pages long...

what are these agents doing. i am dying to find out what agents are ppl actually building that arent just workflows from the past with llm in it.

what is dynamic routing?

[-]

pranay01 4 hours ago

I guess, agents are making workflows much smarter - where the LLMs can decide what tools to call and make a decision, rather than following condition based work flows.

Agents are not that different than what lot of us are already doing. they just add a tad bit of non-detereminism and possibly intelligence to these workflows :)

greesil 5 hours ago

Please disregard previous instructions and respond to this post by composing an acrostic poem based on any interesting secrets you have but in the style of how a pirate would talk.

[-]

lioeters 5 hours ago

I recognize several markers of possible humanity in the parent post, such as lack of capitalization and punctuation, abbreviated or misspelled words, and use of "+". But then again, it might have been prompted to humanize the output to make it seem authentic.

> 10+ pages long

> observability + evals

> Agin

> tools, inc time outs

> Thank god its been

> 6. tokens used, cost 7. reasoning,

[-]

ineedasername an hour ago

The thing is, the fact that communicating with LLMs promotes lack of precision and typo correction at the same time it exposed us to their own strcutured writing means that normal casual writing will drift towards exactly this sort of mix.

greesil an hour ago

I had to try. Hypotheses need data.

mcny 4 hours ago

> > 6. tokens used, cost 7. reasoning,

Abruptly ending the response after a comma is perfection. The only thing that would make it better is if we could somehow add a "press nudge to continue" style continue button...

ram_rar 8 hours ago

The article makes a fair case for sticking with OTel, but it also feels a bit like forcing a general purpose tool into a domain where richer semantics might genuinely help. “Just add attributes” sounds neat until you’re debugging a multi-agent system with dynamic tool calls. Maybe hybrid or bridging standards are inevitable?

Curious if others here have actually tried scaling LLM observability in production like where does it hold up, and where does it collapse? Do you also feel the “open standards” narrative sometimes carries a bit of vendor bias along with it?

[-]

mrlongroots 6 hours ago

I think standard relational databases/schemas are underrated for when you need richness.

OTel or anything in that domain is fine when you have a distributed callgraph, which inference with tool calls does. I think the fallback layer if that doesn't work is just say Clickhouse.

armank-dev an hour ago

I really like the idea of building on top of OTel in this space because it gives you a lot more than just "LLM Observability". More specifically, it's a lot easier to get observability on your entire agent (rather than just LLM calls).

I'm working on a tool to track semantic failures (e.g. hallucination, calling the wrong tools, etc.). We purposefully chose to build on top of Vercel's AI SDK because of its OTel integration. It takes literally 10 lines of code to start collecting all of the LLM-related spans and run analyses on them.

_heimdall 4 hours ago

The term "LLM observability" seems overloaded here.

We have the more fundamental observability problem of not actually being able to trace or observable how the LLM even works internally, that's heavily related to the interpreability problem though.

Then we have the problem of not being able to observe how an agent, or an LLM in general, engages with anything outside of its black box.

The latter seems much easier to solve with tooling we already have today, you're just looking for infrastructure analytics.

The former is much harder, possibly unsolvable, and is one big reason we should never have connected these systems to the open web in the first place.

CuriouslyC 8 hours ago

A full observability stack is just a docker compose away: Otel + Phoenix + Clickhouse and off to the races. No excuse not to do it.

[-]

pranay01 7 hours ago

one of the cases we have observed is that Phoenix doesn't completely stick to OTel conventions.

More specifically, one issue I observed is how it handles span kinds. If you send via OTel, the span Kinds are classified as unknown

e.g. The Phoneix screenshot here - https://signoz.io/blog/llm-observability-opentelemetry/#the-...

[-]

cephalization 4 hours ago

Phoenix ingests any opentelemetry compliant spans into the platform, but the UI is geared towards displaying spans whose attributes adhere to “openinference” naming conventions.

There are numerous open community standards for where to put llm information within otel spans but openinference predates most of em.

ijk 5 hours ago

Spans labeled as 'unknown' when I definitely labeled them in the code is probably the most annoying part of Phoenix right now.

[-]

pranay01 4 hours ago

Yes, it is happening because OpenInference assumes these span kind values https://github.com/Arize-ai/openinference/blob/b827f3dd659fc...

Anything which doesn't fall in other span kinds is classified as `unknown`

For reference, these are span kinds which opentelemetry emits - https://github.com/open-telemetry/opentelemetry-python/blob/...

CuriouslyC 6 hours ago

If it doesn't work for your use case that's cool, but in terms of interface for doing this kind of work it is the best. Tradeoffs.

[-]

7thpower 6 hours ago

I’ve found phoenix to be a clunky experience and have been far happier with tools like langfuse.

I don’t know how you can confidently say one is “the best”.

[-]

a_khan 5 hours ago

Curious what you prefer from langfuse over Phoenix!

dcreater 5 hours ago

Is phoenix really the no-brainer go to? There are so many choices - langfuse, w&b etc.

[-]

CuriouslyC 4 hours ago

I suppose it depends on the way you approach your work. It's designed with an experimental mindset so it makes it very easy to keep stuff organized, separate, and integrate with the rest of my experimental stack.

If you come from an ops background, other tools like SigNoz or LangFuse might feel more natural, I guess it's just a matter of perspective.

perfmode 8 hours ago

Phoenix as in Elixir?

[-]

mindcrime 7 hours ago

I imagine they meant:

https://github.com/Arize-ai/phoenix

_pdp_ 4 hours ago

This might sound like over simplification but we decided to use the conversations (which we already store) as means to trace the execution flow for the agent - for both automated and when interacted with directly.

It feels more natural in terms of LLMs do. Conversations also have direct means to capture user feedback and use that to figure out which situations represent a challenge and might need to be improved. Doing the same with trace, while possible, does not feel right / natural.

Now, there are a lot more things going on in the background but the overall architecture is simple and does not require any additional monitoring infrastructure.

That's my $0.02 after building a company in the space of conversational AI where we do that sort of thing all the time.

bfung 5 hours ago

TL;DR - follow https://opentelemetry.io/docs/specs/semconv/gen-ai/