I spent a couple years building a high performance, expressive library for structured outputs in LLMs. Our library is used by OpenAI for structured outputs on the hosted API. Happy to answer questions on how this works:
Well, thank you for that; from a quick skim of Guidance, it looks like it is used when interfacing with the model directly - i.e. if I want to use Guidance I can't simply send input to my local Ollama instance, I have to stand up a small Python program that loads the model, accepts input from the user, push the user input tokens into the model, and for each output token, reject it if it fails some criteria.
Is this correct? If so, it means that the current way LLMs are interfaced with (via stdin/stout or an HTTP endpoint) can't be used with something like Guidance, correct?
I'm trying to write a really large book. I have a lot of material that I'm using RAG to help manage. I put into my prompts the top RAG cosine scores with some summaries of characters and previous chapters and scene sketches. I get scenes out and then work them over. LLMs are really helpful for my disability and have allowed me to make any progress at all on this.
Is your thing something I should look into for helping keep track of my material. I'm using Excel sheets and crappy python code right now.
Im pretty sure your stuff is some super technical backend thingy, but I figured I'd shoot my shot here. Thanks for any and all info, I appreciate it
TL;DR instead of just getting a token and seeing if it would be accepted by the parser, you can actually zero-out probabilities for all invalid tokens, and do the computation for this in parallel at effectively zero cost:
> Here, compute_mask() can run on the CPU during the time it would be normally just waiting for the GPU to finish. The line prob[~mask] = 0.0 would normally be fused into the softmax kernel in the last stage of the LLM, with negligible overhead. Therefore, as long as the compute_mask() function completes faster than the LLM forward pass and parser.consume() is negligible (typically follows from compute_mask() speed), the constrained generation will be as fast as the unconstrained one.
I'm curious - have there been any research/conversations about pushing masking even earlier in the pipeline? In theory, there's a fair amount of compute that goes into computing the probability of tokens that will end up being masked away anyways.
"The constraint system offered by Guidance is extremely powerful. It can ensure that the output conforms to any context free grammar (so long as the backend LLM has full support for Guidance). More on this below." --from https://github.com/guidance-ai/guidance/
I didn't find any more on that comment below. Is there a list of supported LLMs?
We have support for Huggingface Transformers, llama.cpp, vLLM, SGLang, and TensorRT-LLM, along with some smaller providers (e.g. mistral.rs). Using any of these libraries as an inference host means you can use an OSS model with the guidance backend for full support. Most open source models will run on at least one of these backends (with vLLM probably being the most popular hosted solution, and transformers/llama.cpp being the most popular local model solutions)
We're also the backend used by OpenAI/Azure OpenAI for structured outputs on the closed source model side.
We did quite a thorough benchmarking of various structured decoding providers in one of our papers: https://arxiv.org/abs/2501.10868v3 , measuring structured outputs providers on performance, constraint flexibility, downstream task accuracy, etc.
Happy to chat more about the benchmark. Note that these are a bit out of date though, I'm sure many of the providers we tested have made improvements (and some have switched to wholesale using llguidance as a backend)
I think @dcreater was asking how these various structee decoding providers compare with how pydantic ai handles structured output, i.e via tool calling, forcing the LLM to use a tool and its arguments are a json schema hence you read the tool call arguments and get a structured output.
Guidance is genuinely impressive for anyone wrangling LLM output. The ability to map grammar constraints so efficiently at inference solves so many subtle issues—tokenization headaches being just one. Curious if you've benchmarked adoption for JSON vs. custom grammars among production teams? Anecdotally, JSON's become the baseline, but custom grammars unlock way more nuanced applications.
Great question re: adoption...it's definitely dominated by JSON. Most API providers have standardized on JSON outputs, so application teams have started building shims that map other formats to JSON and back. Similarly, with models heavily being post-trained to generate "good" JSON, I think there's a better model-constraint alignment story with JSON than most arbitrary grammars.
That said, internally, we experiment quite a lot with custom grammars all across the stack. It's more complicated to write a grammar than a JSON schema (though LMs are very good at grammar writing now) and more error prone to debug, but it can help significantly in certain cases (e.g. having models write custom DSLs not commonly found on the internet, at various parts of a model training pipeline, etc. etc.). I'm hoping that with the right tooling around it, the broader community will start nudging beyond JSON.
To that end, the python guidance library is really an attempt to make writing grammars more friendly to a python programmer. More to be done here of course!
I've been curious about grammar support for non-JSON applications. (i.e., I have some use cases where XML is more natural and easier to parse but Pydantic seems to assume you should only work with JSON.) Would guidance be able to handle this use case?
In general I find that matching the most natural format for a document outperforms waiting for the big model trainers to convince the model that the format you want is a valid structure, so anything that lets me interweave structured and unstructured generation is very interesting to me right now.
guidance can handle many context-free grammars. We use an Earley parser under the hood (https://en.wikipedia.org/wiki/Earley_parser) which gives us significant flexibility boosts over alternative approaches that use weaker parsers (and went through lots of effort to make Earley parsing fast enough to not slow down LM inference). However, XML is not perfectly context-free, though with some basic assumptions you can make it CF.
The annoying bit with grammars is that they are unfortunately a bit complex to write properly. Fortunately language models are getting better at this, so hopefully to get an XML grammar, you can get most of the way there with just a GPT-5 prompt. Suppose it would be a good idea to have a better pre-built set of popular grammars (like a modified XML) in guidance so that we cut this headache out for users...!
I'm really just looking for a subset of XML so that's probably sufficient.
For me, the advantage that Pydantic AI has right now is that it's easy to do ingestion/validation of the generated text, since I've already got the typing information in place. If I had similar ways to create new specialized grammars on the fly (e.g., I want XML-ish tags with these fields, but also allow for arbitrary additional fields...) that would significantly sway my implementation decisions.
This is off-tangent but I find it a bit odd that the blog uses a URL fragment to load different articles when it's usually used to navigate within a page.
A consequence of this seems to be that clicking the link to a different article leaves you at the bottom of the page even though the article itself has changed.
This seems to be using JS to fetch the markdown and then render it but I do feel that it may be better off to simply pre-convert the markdown as part of the deployment process and serve the static page.
This is a great writeup! There was a period where reliable structured output was a significant differentiator and was the 'secret sauce' behind some companies success. A NL->SQL company I am familiar with comes to mind. Nice to see this both public and supported by a growing ecosystem of libraries.
One statement surprised me was that the author thinks "models over time will just be able to output JSON perfectly without the need for constraining over time."
I'm not sure how this conclusion was reached. "Perfectly" is a bar that probabilistic sampling cannot meet.
Thank you! Maybe not "perfect" but near-perfect is something we can expect. Models like the Osmosis structure which just structure data inspired some of that thinking (https://ollama.com/Osmosis/Osmosis-Structure-0.6B). Historically, JSON generation has been a latent capability of a model rather than a trained one, but that seems to be changing. gpt-oss was particularly trained for this type of behavior and so the token probabilities are heavily skewed to conform to JSON. Will be interesting to see the next batch of models!
You're spot on about the "perfect" JSON bar being unreachable for now. The only consistently reliable method I've seen in the wild is some form of constrained decoding or grammar enforcement—bit brittle, but practical. Sampling will always be fuzzy unless the architecture fundamentally shifts. Anyone claiming zero-validity issues is probably glossing over a ton of downstream QA work.
We’ve had a lot of success implementing schema-aligned parsing in BAML, a DSL that we’ve built to simplify this problem.
We actually don’t like constrained generation as approach - among other issues it limits your ability to use reasoning - and instead the technique we’re using is algorithm-driven error-tolerant output parsing.
Google's Gemini API is a bit odd with structured outputs. If you specify an Application/JSON response mimetype, it will reliably respond with a consistent JSON output without any prompt engineering shenanigans. For my workflows, this setting plus providing a JSON Schema in the system prompt works even with complex schema.
The Gemini API has a canonical implementation of structured outputs where you can instead pass the JSON schema as a separate parameter to control the grammar more closely. However, this setting will reorder the JSON schema fields to be alphabetical beforehand, which is especially not desired behavior as the order of JSON fields in a schema is often very deliberate to control generation.
I've a related observation. In my experience the amount of hallucinated urls with structured output (think of a field `url` or `link`) is pretty high. Especially compared to the alternative approach, where you let the llm generate text and then use a second llm to convert the text into the desired structured format.
With structured output, it's like the llm is forced to answer in a very specific way. So if there is no url for the given field, it makes up the url.
Here a related quote from the article:
> Structured outputs builds on top of sampling by constraining the model's output to a specific format.
What I've found is that it is very important to make structured outputs as easy for the LLM as possible. This means making your schemas LLM-friendly instead of programmer-friendly.
E.g. if the LLM hallucinates non-existing URLs, you may add a boolean "contains_url" field to your entity's JSON schema, placing it before the URL field itself. This way, the URL extraction is split into two simpler steps, checking if the URL is there and actually extracting it. If the URL is missing, the `"contains_url": false` field in the context will strongly urge the LLM to output an empty string there.
This also comes up with quantities a lot. Imagine you're trying to sort job adverts by salary ranges, which you extract via LLm. . These may be expressed as monthly instead of annual (common in some countries), in different currencies, pre / post tax etc.
Instead of having an `annual_pretax_salary_usd` field, which is what you actually want, but which the LLM is extremely ill-equipped to generate, have a detailed schema like `type: monthly|yearly, currency:str, low:float, high:float, tax: pre_tax|post_tax`.
That schema is much easier for an LLM to generate, and you can then convert it to a single number via straight code.
As you know, (most current) LLMs build text autoregressively. This allows them to generate text with _exactly_ the same distribution as the training data.
When you constrain LLM output at each token, that gives a completely different distribution from letting the LLM generate a full output and then doing something with that (trying again, returning an error, post-processing, etc).
E.g.: Suppose the LLM has a training set of (aa, ab, ab, ba), noting that "ab" appears twice. Suppose your valid grammar is the set (ab, ba). Then your output distributions are:
Baseline: {invalid: 25%, ab: 50%, ba: 25%}
Constrained: {invalid: 0%, ab: 75%, ba: 25%}
Note that _all_ the previously invalid outputs were dumped into the "ab" bucket, skewing the ratio between "ab" and "ba". That skew may or may not be desirable, but assuming the training process was any good it's likely undesirable.
You've observed it in URLs, but I see it in JSON output as well. LLMs like to truncate long strings from time to time, but when they do they're more likely to provide invalid JSON (adding an ellipsis at the end of the fragment and doing nothing else). If that truncation starts to happen in a constrained environment, a period is a valid character in a long string, and eventually the grammar constraint will force a closing quote to appear. The result is still garbage, but instead of a detectable parse failure you have an undetectable corrupt field.
1. Choose the first token. If well-trained you have a 75% chance of choosing "a" and a 25% chance of choosing "b". Both are valid for that grammar.
2. Choose the second token. Regardless of your first token there is exactly once choice of grammar-adhering completion. You're now at a 75% chance of "ab" and a 25% chance of "ba" (mirroring the first-token chance).
For a toy example like this you obviously wouldn't use an LLM, but techniques like you're suggesting don't work because it's infeasible to enumerate all the valid outputs and re-weight and because greedy and semi-greedy strategies aren't anywhere near sufficient to side-step the issue. At the point in time you select the "a" token at a 75% probability it's game-over unless you re-run the LLM. You can't beam search either (doing so just changes which token you'll mis-predict, and even then only for very local grammar mistakes).
Looking at my JSON example from earlier, a beam search to avoid that re-weighting requires a depth of at least 4 (going as far as the ellipsis plus the stop token), and it won't suffice to just consider locally high-weight paths (you can probably hack something together for that one issue in particular which searches high weight paths and backtracks if they're found to be low-weight due to grammar mismatches, but that has its own bias unless you fan out to all 1e19 length-4 paths, and it won't solve the general problem regardless).
Phrased slightly differently, you don't have a compute_future_grammar_adhering_weight(token) function which is tractably computable, so you can't actually redistribute the 8.3% probability from the "a" branch to the "b" branch.
Oh now I understand. I thought your ab and ba were single tokens (even though that doesn't make sense in context). Once you point out they're separate tokens, I follow you. Thank you!
Edit: that's a great example
Edit 2: even more fun: training data is [ab, ab, ba, bb, bb, bb]. Then constrained sampling flips your likelihood from 1:2 to 2:1
> let the llm generate text and then use a second llm to convert the text into the desired structured format
this sounds similar to what they discussed in the article with regards to "thinking" models, i.e. let them generate their <think>blah blah</think> preamble first before starting to constrain the output to structured format
Have you tried techniques that don’t require modifying the LLM and the sampling strategy for structure outputs? For example, schema aligned passing, where you build error tolerance into the parser instead of coercing to a grammar.
It looks really slick, for us the reason we haven't adopted yet is it brings more tooling and configuration that overlaps with our existing system for prompt templates, schema definitions, etc. In the component where we couldn't rely on OpenAI structured outputs we experimented with TOML-formatted output, that ended up being reliable enough to solve the problem across many models without any new dependencies. I do think we'll revisit at some point as Boundary also provides incremental parsing of streaming outputs and may allow some cost optimization that is not easy right now.
If you can screen tokens against your grammar fast enough, you can build a bitmask over the entire token vocabulary and apply it right before sampling. As vocabulary sizes grow, this gets more complex to do in real time, but we (and other libraries) have found several optimizations to do this extremely quickly (eg for guidance, we detail some optimizations here https://github.com/guidance-ai/llguidance/blob/main/docs/opt...).
Other libraries work by essentially pre-computing all the masks for all possible generations, but of course you're restricted to working with simple grammars in this case (like a subset of regular expressions)
It's not expensive per-se; A single element-wise multiplication of the output vector.
The real "expense" is that you need to prepare masks for every element of your grammar as they are expensive to recompute as needed; LLM tokens do not cleanly map onto elements of your grammar. (Consider JSON: LLM tokens often combine various special characters such as curly braces, colons, and quotes.)
This isn't that hard to compute, it's just more work to implement.
Good question—some frameworks do apply the mask immediately, others defer for performance or implementation simplicity. Mask precomputation can get tricky with large vocabularies, especially if grammar elements span multiple tokens. Immediate masking is usually preferred, but optimizations kick in when you're juggling complicated grammars or working against throughput bottlenecks.
I was hoping to find some insights about why performance drops when using actual structured outputs. It's been a known problem. For example this paper "Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models" says:
> Surprisingly, we observe a significant decline in LLMs’ reasoning abilities under format restrictions. Furthermore, we find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.
That paper had some serious methodological issues and the results have been shown to be misunderstood/incorrect in the majority of cases. In fact, in many cases structured outputs have shown to improve the quality of the results from an LLM (at least in terms of evaluation performance). The team at behind the Outlines library released a response the covers the issues in details and provides more information about structured outputs [0].
Constrained generation guarantees syntax. It does not guarantee semantic correctness tho. Imagine you want a json object with "hp" and "damage". If you use a grammar, the model will be forced to output a json object with those two values. But it's not guaranteed to get sensible values.
With a 2nd pass you basically "condition" it on the text right above, hoping to get better semantic understanding.
I'm pretty sure the grammar is generated from the Json schema, it doesn't just constrain json syntax, it constraints on the schema (including enums and such). The schema is also given to the model (at least in openai) you can put instructions in the json schema as well that will be taken into account.
Perhaps I worded that poorly. What I mean by semantic correctness is that the model could output nonsensical values for some things. Say in a game, "normal" health is ~100hp and the model creates a wizard with 50hp but then a mouse with 10000hp. So you're guaranteed to get a parsable json object (syntactically correct) but what the values are in that json is not guaranteed to make sense in the given context.
I find it does pretty well given a reasonable prompt and (especially) well-named keys/JSON structure. So if you had boss.mouse.hp you would get higher HP than random_enemies.mouse.hp, or better: enemies.level_1.mouse.hp.
Hmm, so if structured output affects the quality of the response maybe it's better to convert the output to a structured format as a post-processing step?
It's a tradeoff between getting "good enough" performance w/ guided/constrained generation and using 2x calls to do the same task. Sometimes it works, sometimes it's better to have a separate model. One good case of 2 calls is the "code merging" thing, where you "chat" with a model giving it a source file + some instruction, and if it replies with something like ... //unchanged code here ... some new code ... //the rest stays the same, then you can use a code merging model to apply the changes. But that's become somewhat obsolete by the new "agentic" capabilities where models learn how to diff files directly.
Depending on the task you can often get it in about one request on average. Ask for the output in Markdown with reasoning up front and the structured output in a code block at the end, then extract and parse that bit in code.
Haiku is my favorite model for the second pass. It's small cheap and usually gets it right. If I see hallucinations they are mostly from the base model in the first pass.
After endlessly tweaking the SQL generators[1] that I am working on, I would recommend setting a "reasoning" output string to activate step by step thinking and better responses. Even better if you can add output "reasoning strings" more relevant to the specific task you are trying to solve.
Depends on your use case. Post-processing can save headaches when soft constraints are fine or you want max flexibility, but you risk subtle errors slipping by. For API responses or anything that gets parsed downstream, I still trust grammar-constrained generation more—it just surfaces problems earlier.
We do enable forcing these sequences of tokens in guidance, and find that it significantly speeds up structured generation. There are tricky alignment issues to make sure you pick the right sequence of tokens, but you can often proxy this well by using the model's native tokenizer. Some details here in an old blog: https://guidance.readthedocs.io/en/latest/example_notebooks/...
In most cases, yes—forcing is common when the grammar dictates a single valid option. It's a fast path. Trickier cases arise if multiple tokens could satisfy the same grammar position, especially with weird tokenizations or BPE merges. Edge cases can trip token selection, but for things like brackets/commas, forced emission usually works flawlessly.
I don't think so, because multiple tokens might match. If it needs a comma as the next character, but you have tokens for `, "blah` and `, "foo` you still want to leave those on the table.
My takeaway still today is that structured output is a two part process. If you require any heavy lifting on the LLM side, introducing structured output is going to cause reduced quality.
These techniques are limited to structures that can be checked with bounded history or bounded memory (that can be checked with a grammar or FSA). What about more complex structures that don't factor easily?
Just wait till people realize that if you have agents speak in structured output rather than chatting with you, your observability and ability to finely program your agent goes through the roof.
It's still baffling to me that the various API providers don't let us upload our custom grammars. It would enable so many use cases, like HTML generation for example, at essentially no cost on their part.
There are some implementation concerns, but the real answer is that it is an ideological choice.
The AI companies believe that these kinds of grammar mistakes will be solved by improving the models. To build out tools for grammar constrained inference like this is to suggest, on some level, that GPT-N+1 won't magically solve the problem.
The deeper level is that it's not just simple grammar constraints. Constraining to JSON is a nice party trick, but it opens the door to further ideas. How about constraining to a programming language's grammar? Those are well defined, you just swap the JSON grammar file for the Java grammar file, job done.
We can go further: Why not use a language server to constrain not only the grammar but also the content? What variables and functions are in-scope is known, constraining a variable reference or function call to one of their names can be done with the same techique as grammar constraints. ("monitor-guided decoding", figured out back in 2023)
Entire classes of hallucination problems can be eliminated this way. The marketing writes itself; "Our AI is literally incapable of making the errors humans make!"
What many AI developers, firms, and especially their leaders find grating about this is the implication. That AI is fallible and has to be constrained.
Another such inconvenience is that while these techniques improve grammar they highlight semantic problems. The code is correct & compiles, it just does the wrong thing.
One pattern that I've seen develop (in PydanticAI and elsewhere) is to constrain the output but include an escape hatch. If an error happens, that lets it bail out and report the problem rather than be forced to proceed down a doomed path.
You don't need a new model. The trick of the technique is that you only change how tokens are sampled; Zero out the probability of every token that would be illegal under the grammar or other constraints.
All you need for that is an inference API that gives you the full output vector, which is trivial for any model you run on your own hardware.
Using grammar constrained output in llama.cpp - which has been available for ages and I think is a different implementation to the one described here - does slow down generation quite a bit. I expect it has a naive implementation.
As to why providers don't give you a nice API, maybe it's hard to implement efficiently.
It's not too bad if inference is happening token by token and reverting to the CPU every time, but I understand high performance LLM inference uses speculative decoding, with a smaller model guessing multiple tokens in advance and the main model doing verification. Doing grammar constraints across multiple tokens is tougher, there's an exponential number of states that need precomputing.
So you'd need to think about putting the parser automaton onto the GPU/TPU and use it during inference without needing to stall a pipeline by going back CPU.
And then you start thinking about how big that automaton is going to be. How many states, pushdown stack. You're basically taking code from the API call and running it on your hardware. There's dragons here, around fair use, denial of service etc.
There's also a grammar validation tool in the default llama.cpp build, which is much easier to reason about for debugging grammars than having them bounce off the server.
Wouldn't that have implications for inference batching, since you would have to track state and apply a different mask for each sequence in the batch? If so, I think it would directly affect utilisation and hence costs. But I could be talking out of my ass here.
I mean, most don't? I know you can provide a pseudo-EBNF grammar to llama.cpp but, for example, none of Anthropic, Azure, Bedrock, Mistral or Gemini allow us the same.
This post dives into that "black magic" layer, especially in the context of emerging thinking models and tools like Ollama or GPT-OSS. It’s a thoughtful look at why sampling, formatting, and standardization are not just implementation details, but core to the future of working with LLMs.
I spent a couple years building a high performance, expressive library for structured outputs in LLMs. Our library is used by OpenAI for structured outputs on the hosted API. Happy to answer questions on how this works:
User friendly library that connects to lots of OSS model serving backends: https://github.com/guidance-ai/guidance/
Core Rust library written for high performance mask computation (written mostly by my collaborator @mmoskal): http://github.com/guidance-ai/llguidance
> Happy to answer questions on how this works
Well, thank you for that; from a quick skim of Guidance, it looks like it is used when interfacing with the model directly - i.e. if I want to use Guidance I can't simply send input to my local Ollama instance, I have to stand up a small Python program that loads the model, accepts input from the user, push the user input tokens into the model, and for each output token, reject it if it fails some criteria.
Is this correct? If so, it means that the current way LLMs are interfaced with (via stdin/stout or an HTTP endpoint) can't be used with something like Guidance, correct?
I'm stupid, so my question will be too.
I'm trying to write a really large book. I have a lot of material that I'm using RAG to help manage. I put into my prompts the top RAG cosine scores with some summaries of characters and previous chapters and scene sketches. I get scenes out and then work them over. LLMs are really helpful for my disability and have allowed me to make any progress at all on this.
Is your thing something I should look into for helping keep track of my material. I'm using Excel sheets and crappy python code right now.
Im pretty sure your stuff is some super technical backend thingy, but I figured I'd shoot my shot here. Thanks for any and all info, I appreciate it
The LLGuidance paper is highly recommended reading for everyone interested in this! https://guidance-ai.github.io/llguidance/llg-go-brrr
TL;DR instead of just getting a token and seeing if it would be accepted by the parser, you can actually zero-out probabilities for all invalid tokens, and do the computation for this in parallel at effectively zero cost:
> Here, compute_mask() can run on the CPU during the time it would be normally just waiting for the GPU to finish. The line prob[~mask] = 0.0 would normally be fused into the softmax kernel in the last stage of the LLM, with negligible overhead. Therefore, as long as the compute_mask() function completes faster than the LLM forward pass and parser.consume() is negligible (typically follows from compute_mask() speed), the constrained generation will be as fast as the unconstrained one.
I'm curious - have there been any research/conversations about pushing masking even earlier in the pipeline? In theory, there's a fair amount of compute that goes into computing the probability of tokens that will end up being masked away anyways.
I'm also working on a library to steer the sampling step of LLM's but more for steganographic / arbitrary data encoding purposes.
Should work with any llama.cpp compatible model: https://github.com/sutt/innocuous
"The constraint system offered by Guidance is extremely powerful. It can ensure that the output conforms to any context free grammar (so long as the backend LLM has full support for Guidance). More on this below." --from https://github.com/guidance-ai/guidance/
I didn't find any more on that comment below. Is there a list of supported LLMs?
Good point re: documentation...
We have support for Huggingface Transformers, llama.cpp, vLLM, SGLang, and TensorRT-LLM, along with some smaller providers (e.g. mistral.rs). Using any of these libraries as an inference host means you can use an OSS model with the guidance backend for full support. Most open source models will run on at least one of these backends (with vLLM probably being the most popular hosted solution, and transformers/llama.cpp being the most popular local model solutions)
We're also the backend used by OpenAI/Azure OpenAI for structured outputs on the closed source model side.
How does this compare to pydantic ai?
I'm yet to see a thorough comparison of design, performance and reliability between these options (along with outlines etc)
We did quite a thorough benchmarking of various structured decoding providers in one of our papers: https://arxiv.org/abs/2501.10868v3 , measuring structured outputs providers on performance, constraint flexibility, downstream task accuracy, etc.
Happy to chat more about the benchmark. Note that these are a bit out of date though, I'm sure many of the providers we tested have made improvements (and some have switched to wholesale using llguidance as a backend)
I think @dcreater was asking how these various structee decoding providers compare with how pydantic ai handles structured output, i.e via tool calling, forcing the LLM to use a tool and its arguments are a json schema hence you read the tool call arguments and get a structured output.
pydantic is a _validation_ library, it does not do any kind of constraints by itself
im referring to pydanticai https://ai.pydantic.dev/
Guidance is genuinely impressive for anyone wrangling LLM output. The ability to map grammar constraints so efficiently at inference solves so many subtle issues—tokenization headaches being just one. Curious if you've benchmarked adoption for JSON vs. custom grammars among production teams? Anecdotally, JSON's become the baseline, but custom grammars unlock way more nuanced applications.
Thanks :)
Great question re: adoption...it's definitely dominated by JSON. Most API providers have standardized on JSON outputs, so application teams have started building shims that map other formats to JSON and back. Similarly, with models heavily being post-trained to generate "good" JSON, I think there's a better model-constraint alignment story with JSON than most arbitrary grammars.
That said, internally, we experiment quite a lot with custom grammars all across the stack. It's more complicated to write a grammar than a JSON schema (though LMs are very good at grammar writing now) and more error prone to debug, but it can help significantly in certain cases (e.g. having models write custom DSLs not commonly found on the internet, at various parts of a model training pipeline, etc. etc.). I'm hoping that with the right tooling around it, the broader community will start nudging beyond JSON.
To that end, the python guidance library is really an attempt to make writing grammars more friendly to a python programmer. More to be done here of course!
I've been curious about grammar support for non-JSON applications. (i.e., I have some use cases where XML is more natural and easier to parse but Pydantic seems to assume you should only work with JSON.) Would guidance be able to handle this use case?
In general I find that matching the most natural format for a document outperforms waiting for the big model trainers to convince the model that the format you want is a valid structure, so anything that lets me interweave structured and unstructured generation is very interesting to me right now.
guidance can handle many context-free grammars. We use an Earley parser under the hood (https://en.wikipedia.org/wiki/Earley_parser) which gives us significant flexibility boosts over alternative approaches that use weaker parsers (and went through lots of effort to make Earley parsing fast enough to not slow down LM inference). However, XML is not perfectly context-free, though with some basic assumptions you can make it CF.
The annoying bit with grammars is that they are unfortunately a bit complex to write properly. Fortunately language models are getting better at this, so hopefully to get an XML grammar, you can get most of the way there with just a GPT-5 prompt. Suppose it would be a good idea to have a better pre-built set of popular grammars (like a modified XML) in guidance so that we cut this headache out for users...!
I'm really just looking for a subset of XML so that's probably sufficient.
For me, the advantage that Pydantic AI has right now is that it's easy to do ingestion/validation of the generated text, since I've already got the typing information in place. If I had similar ways to create new specialized grammars on the fly (e.g., I want XML-ish tags with these fields, but also allow for arbitrary additional fields...) that would significantly sway my implementation decisions.
This is off-tangent but I find it a bit odd that the blog uses a URL fragment to load different articles when it's usually used to navigate within a page.
A consequence of this seems to be that clicking the link to a different article leaves you at the bottom of the page even though the article itself has changed.
This seems to be using JS to fetch the markdown and then render it but I do feel that it may be better off to simply pre-convert the markdown as part of the deployment process and serve the static page.
This is a great writeup! There was a period where reliable structured output was a significant differentiator and was the 'secret sauce' behind some companies success. A NL->SQL company I am familiar with comes to mind. Nice to see this both public and supported by a growing ecosystem of libraries.
One statement surprised me was that the author thinks "models over time will just be able to output JSON perfectly without the need for constraining over time."
I'm not sure how this conclusion was reached. "Perfectly" is a bar that probabilistic sampling cannot meet.
Thank you! Maybe not "perfect" but near-perfect is something we can expect. Models like the Osmosis structure which just structure data inspired some of that thinking (https://ollama.com/Osmosis/Osmosis-Structure-0.6B). Historically, JSON generation has been a latent capability of a model rather than a trained one, but that seems to be changing. gpt-oss was particularly trained for this type of behavior and so the token probabilities are heavily skewed to conform to JSON. Will be interesting to see the next batch of models!
You're spot on about the "perfect" JSON bar being unreachable for now. The only consistently reliable method I've seen in the wild is some form of constrained decoding or grammar enforcement—bit brittle, but practical. Sampling will always be fuzzy unless the architecture fundamentally shifts. Anyone claiming zero-validity issues is probably glossing over a ton of downstream QA work.
We’ve had a lot of success implementing schema-aligned parsing in BAML, a DSL that we’ve built to simplify this problem.
We actually don’t like constrained generation as approach - among other issues it limits your ability to use reasoning - and instead the technique we’re using is algorithm-driven error-tolerant output parsing.
https://boundaryml.com/
Love your work , thanks ! , 12 factor agent implementation uses your tools too.
Google's Gemini API is a bit odd with structured outputs. If you specify an Application/JSON response mimetype, it will reliably respond with a consistent JSON output without any prompt engineering shenanigans. For my workflows, this setting plus providing a JSON Schema in the system prompt works even with complex schema.
The Gemini API has a canonical implementation of structured outputs where you can instead pass the JSON schema as a separate parameter to control the grammar more closely. However, this setting will reorder the JSON schema fields to be alphabetical beforehand, which is especially not desired behavior as the order of JSON fields in a schema is often very deliberate to control generation.
I was burned by this for a while because I assumed structured output ordering would be preserved.
You can specify ordering in the Gemini API with propertyOrdering:
"propertyOrdering": ["recipeName", "ingredients"]
JSON is still not available when you enable Grounding with Search.
gemini api has propertyOrdering field for that
that only works for the outer level, not for any nested fields
That was as great reading, thank you.
I've a related observation. In my experience the amount of hallucinated urls with structured output (think of a field `url` or `link`) is pretty high. Especially compared to the alternative approach, where you let the llm generate text and then use a second llm to convert the text into the desired structured format.
With structured output, it's like the llm is forced to answer in a very specific way. So if there is no url for the given field, it makes up the url.
Here a related quote from the article:
> Structured outputs builds on top of sampling by constraining the model's output to a specific format.
What I've found is that it is very important to make structured outputs as easy for the LLM as possible. This means making your schemas LLM-friendly instead of programmer-friendly.
E.g. if the LLM hallucinates non-existing URLs, you may add a boolean "contains_url" field to your entity's JSON schema, placing it before the URL field itself. This way, the URL extraction is split into two simpler steps, checking if the URL is there and actually extracting it. If the URL is missing, the `"contains_url": false` field in the context will strongly urge the LLM to output an empty string there.
This also comes up with quantities a lot. Imagine you're trying to sort job adverts by salary ranges, which you extract via LLm. . These may be expressed as monthly instead of annual (common in some countries), in different currencies, pre / post tax etc.
Instead of having an `annual_pretax_salary_usd` field, which is what you actually want, but which the LLM is extremely ill-equipped to generate, have a detailed schema like `type: monthly|yearly, currency:str, low:float, high:float, tax: pre_tax|post_tax`.
That schema is much easier for an LLM to generate, and you can then convert it to a single number via straight code.
Awesome insight, thanks for this!
That's definitely possible.
As you know, (most current) LLMs build text autoregressively. This allows them to generate text with _exactly_ the same distribution as the training data.
When you constrain LLM output at each token, that gives a completely different distribution from letting the LLM generate a full output and then doing something with that (trying again, returning an error, post-processing, etc).
E.g.: Suppose the LLM has a training set of (aa, ab, ab, ba), noting that "ab" appears twice. Suppose your valid grammar is the set (ab, ba). Then your output distributions are:
Baseline: {invalid: 25%, ab: 50%, ba: 25%}
Constrained: {invalid: 0%, ab: 75%, ba: 25%}
Note that _all_ the previously invalid outputs were dumped into the "ab" bucket, skewing the ratio between "ab" and "ba". That skew may or may not be desirable, but assuming the training process was any good it's likely undesirable.
You've observed it in URLs, but I see it in JSON output as well. LLMs like to truncate long strings from time to time, but when they do they're more likely to provide invalid JSON (adding an ellipsis at the end of the fragment and doing nothing else). If that truncation starts to happen in a constrained environment, a period is a valid character in a long string, and eventually the grammar constraint will force a closing quote to appear. The result is still garbage, but instead of a detectable parse failure you have an undetectable corrupt field.
Why do you think the constrained percentages are 0/75/25 and not eg 0/66/33? (ie same relative likelihood for valid outputs)
The constraint algorithm looks something like:
1. Choose the first token. If well-trained you have a 75% chance of choosing "a" and a 25% chance of choosing "b". Both are valid for that grammar.
2. Choose the second token. Regardless of your first token there is exactly once choice of grammar-adhering completion. You're now at a 75% chance of "ab" and a 25% chance of "ba" (mirroring the first-token chance).
For a toy example like this you obviously wouldn't use an LLM, but techniques like you're suggesting don't work because it's infeasible to enumerate all the valid outputs and re-weight and because greedy and semi-greedy strategies aren't anywhere near sufficient to side-step the issue. At the point in time you select the "a" token at a 75% probability it's game-over unless you re-run the LLM. You can't beam search either (doing so just changes which token you'll mis-predict, and even then only for very local grammar mistakes).
Looking at my JSON example from earlier, a beam search to avoid that re-weighting requires a depth of at least 4 (going as far as the ellipsis plus the stop token), and it won't suffice to just consider locally high-weight paths (you can probably hack something together for that one issue in particular which searches high weight paths and backtracks if they're found to be low-weight due to grammar mismatches, but that has its own bias unless you fan out to all 1e19 length-4 paths, and it won't solve the general problem regardless).
Phrased slightly differently, you don't have a compute_future_grammar_adhering_weight(token) function which is tractably computable, so you can't actually redistribute the 8.3% probability from the "a" branch to the "b" branch.
Oh now I understand. I thought your ab and ba were single tokens (even though that doesn't make sense in context). Once you point out they're separate tokens, I follow you. Thank you!
Edit: that's a great example
Edit 2: even more fun: training data is [ab, ab, ba, bb, bb, bb]. Then constrained sampling flips your likelihood from 1:2 to 2:1
> let the llm generate text and then use a second llm to convert the text into the desired structured format
this sounds similar to what they discussed in the article with regards to "thinking" models, i.e. let them generate their <think>blah blah</think> preamble first before starting to constrain the output to structured format
Have you tried techniques that don’t require modifying the LLM and the sampling strategy for structure outputs? For example, schema aligned passing, where you build error tolerance into the parser instead of coercing to a grammar.
https://boundaryml.com/blog/schema-aligned-parsing
I love BAML. Surprised it’s not more popular. I can get structured outputs on any model even ones that don’t support json schema outputs etc
It looks really slick, for us the reason we haven't adopted yet is it brings more tooling and configuration that overlaps with our existing system for prompt templates, schema definitions, etc. In the component where we couldn't rely on OpenAI structured outputs we experimented with TOML-formatted output, that ended up being reliable enough to solve the problem across many models without any new dependencies. I do think we'll revisit at some point as Boundary also provides incremental parsing of streaming outputs and may allow some cost optimization that is not easy right now.
I've found that writing a very simple DSL that resembles human speech and an interpreter that can output JSON is very effective.
Human
4x1200 with 30 second rest
AI DSL output
Repeat 4 times:
- Run 1200 meters
- Rest 30 seconds
I hand wrote a recursive descent parser in Python to process DSL. Human speech to DSL is pretty effective with a simple prompt and some examples.
I created a tool that can program Garmin & Apple Watches for interval training based on what I wrote above.
https://speedystride.com
Looking for beta testers- please give it a try :)
When doing structured sampling, why is the token sampled, checked against the grammar, and resampled if it's wrong by applying the mask ?
Why wouldn't we apply the mask immediately for the first sampling? Is this an optimization somehow, is masking expensive?
If you can screen tokens against your grammar fast enough, you can build a bitmask over the entire token vocabulary and apply it right before sampling. As vocabulary sizes grow, this gets more complex to do in real time, but we (and other libraries) have found several optimizations to do this extremely quickly (eg for guidance, we detail some optimizations here https://github.com/guidance-ai/llguidance/blob/main/docs/opt...).
Other libraries work by essentially pre-computing all the masks for all possible generations, but of course you're restricted to working with simple grammars in this case (like a subset of regular expressions)
Implementation preference.
> is masking expensive?
It's not expensive per-se; A single element-wise multiplication of the output vector.
The real "expense" is that you need to prepare masks for every element of your grammar as they are expensive to recompute as needed; LLM tokens do not cleanly map onto elements of your grammar. (Consider JSON: LLM tokens often combine various special characters such as curly braces, colons, and quotes.)
This isn't that hard to compute, it's just more work to implement.
Good question—some frameworks do apply the mask immediately, others defer for performance or implementation simplicity. Mask precomputation can get tricky with large vocabularies, especially if grammar elements span multiple tokens. Immediate masking is usually preferred, but optimizations kick in when you're juggling complicated grammars or working against throughput bottlenecks.
I was hoping to find some insights about why performance drops when using actual structured outputs. It's been a known problem. For example this paper "Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models" says:
> Surprisingly, we observe a significant decline in LLMs’ reasoning abilities under format restrictions. Furthermore, we find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.
https://arxiv.org/abs/2408.02442v1
That paper had some serious methodological issues and the results have been shown to be misunderstood/incorrect in the majority of cases. In fact, in many cases structured outputs have shown to improve the quality of the results from an LLM (at least in terms of evaluation performance). The team at behind the Outlines library released a response the covers the issues in details and provides more information about structured outputs [0].
0. https://blog.dottxt.ai/say-what-you-mean.html
Thanks for posting! Didn't expect this to get picked up – it was a bit of a draft haha. Happy to answer questions around structured outputs :)
This constrains the output of the LLM to some grammar.
However, why not use a grammar that does not have invalid sentences, and from there convert to any grammar that you want?
What if the converted version is not in the wanted syntax?
Constrained generation guarantees syntax. It does not guarantee semantic correctness tho. Imagine you want a json object with "hp" and "damage". If you use a grammar, the model will be forced to output a json object with those two values. But it's not guaranteed to get sensible values.
With a 2nd pass you basically "condition" it on the text right above, hoping to get better semantic understanding.
I'm pretty sure the grammar is generated from the Json schema, it doesn't just constrain json syntax, it constraints on the schema (including enums and such). The schema is also given to the model (at least in openai) you can put instructions in the json schema as well that will be taken into account.
Perhaps I worded that poorly. What I mean by semantic correctness is that the model could output nonsensical values for some things. Say in a game, "normal" health is ~100hp and the model creates a wizard with 50hp but then a mouse with 10000hp. So you're guaranteed to get a parsable json object (syntactically correct) but what the values are in that json is not guaranteed to make sense in the given context.
You can specify `minimum` and `maximum` property for these fields. So this schema
is converted to this BNF-like representation:For anyone curious here is an interactive write up about this http://michaelgiba.com/grammar-based/index.html
I find it does pretty well given a reasonable prompt and (especially) well-named keys/JSON structure. So if you had boss.mouse.hp you would get higher HP than random_enemies.mouse.hp, or better: enemies.level_1.mouse.hp.
Hmm, so if structured output affects the quality of the response maybe it's better to convert the output to a structured format as a post-processing step?
It's a tradeoff between getting "good enough" performance w/ guided/constrained generation and using 2x calls to do the same task. Sometimes it works, sometimes it's better to have a separate model. One good case of 2 calls is the "code merging" thing, where you "chat" with a model giving it a source file + some instruction, and if it replies with something like ... //unchanged code here ... some new code ... //the rest stays the same, then you can use a code merging model to apply the changes. But that's become somewhat obsolete by the new "agentic" capabilities where models learn how to diff files directly.
Depending on the task you can often get it in about one request on average. Ask for the output in Markdown with reasoning up front and the structured output in a code block at the end, then extract and parse that bit in code.
Haiku is my favorite model for the second pass. It's small cheap and usually gets it right. If I see hallucinations they are mostly from the base model in the first pass.
After endlessly tweaking the SQL generators[1] that I am working on, I would recommend setting a "reasoning" output string to activate step by step thinking and better responses. Even better if you can add output "reasoning strings" more relevant to the specific task you are trying to solve.
[1]: https://app.sqlai.ai
Depends on your use case. Post-processing can save headaches when soft constraints are fine or you want max flexibility, but you risk subtle errors slipping by. For API responses or anything that gets parsed downstream, I still trust grammar-constrained generation more—it just surfaces problems earlier.
If the current position in the structure only has one possibility (like a comma, bracket, etc.) do you just force that as the next token and continue?
We do enable forcing these sequences of tokens in guidance, and find that it significantly speeds up structured generation. There are tricky alignment issues to make sure you pick the right sequence of tokens, but you can often proxy this well by using the model's native tokenizer. Some details here in an old blog: https://guidance.readthedocs.io/en/latest/example_notebooks/...
In most cases, yes—forcing is common when the grammar dictates a single valid option. It's a fast path. Trickier cases arise if multiple tokens could satisfy the same grammar position, especially with weird tokenizations or BPE merges. Edge cases can trip token selection, but for things like brackets/commas, forced emission usually works flawlessly.
I don't think so, because multiple tokens might match. If it needs a comma as the next character, but you have tokens for `, "blah` and `, "foo` you still want to leave those on the table.
My takeaway still today is that structured output is a two part process. If you require any heavy lifting on the LLM side, introducing structured output is going to cause reduced quality.
These techniques are limited to structures that can be checked with bounded history or bounded memory (that can be checked with a grammar or FSA). What about more complex structures that don't factor easily?
Sounds like brute force to me.
Just wait till people realize that if you have agents speak in structured output rather than chatting with you, your observability and ability to finely program your agent goes through the roof.
It's still baffling to me that the various API providers don't let us upload our custom grammars. It would enable so many use cases, like HTML generation for example, at essentially no cost on their part.
There are some implementation concerns, but the real answer is that it is an ideological choice.
The AI companies believe that these kinds of grammar mistakes will be solved by improving the models. To build out tools for grammar constrained inference like this is to suggest, on some level, that GPT-N+1 won't magically solve the problem.
The deeper level is that it's not just simple grammar constraints. Constraining to JSON is a nice party trick, but it opens the door to further ideas. How about constraining to a programming language's grammar? Those are well defined, you just swap the JSON grammar file for the Java grammar file, job done.
We can go further: Why not use a language server to constrain not only the grammar but also the content? What variables and functions are in-scope is known, constraining a variable reference or function call to one of their names can be done with the same techique as grammar constraints. ("monitor-guided decoding", figured out back in 2023)
Entire classes of hallucination problems can be eliminated this way. The marketing writes itself; "Our AI is literally incapable of making the errors humans make!"
What many AI developers, firms, and especially their leaders find grating about this is the implication. That AI is fallible and has to be constrained.
Another such inconvenience is that while these techniques improve grammar they highlight semantic problems. The code is correct & compiles, it just does the wrong thing.
One pattern that I've seen develop (in PydanticAI and elsewhere) is to constrain the output but include an escape hatch. If an error happens, that lets it bail out and report the problem rather than be forced to proceed down a doomed path.
Most API providers (Together, Fireworks etc) don't build their own models.
You don't need a new model. The trick of the technique is that you only change how tokens are sampled; Zero out the probability of every token that would be illegal under the grammar or other constraints.
All you need for that is an inference API that gives you the full output vector, which is trivial for any model you run on your own hardware.
Though Fireworks is one of the few providers that supports structured generation.
Using grammar constrained output in llama.cpp - which has been available for ages and I think is a different implementation to the one described here - does slow down generation quite a bit. I expect it has a naive implementation.
As to why providers don't give you a nice API, maybe it's hard to implement efficiently.
It's not too bad if inference is happening token by token and reverting to the CPU every time, but I understand high performance LLM inference uses speculative decoding, with a smaller model guessing multiple tokens in advance and the main model doing verification. Doing grammar constraints across multiple tokens is tougher, there's an exponential number of states that need precomputing.
So you'd need to think about putting the parser automaton onto the GPU/TPU and use it during inference without needing to stall a pipeline by going back CPU.
And then you start thinking about how big that automaton is going to be. How many states, pushdown stack. You're basically taking code from the API call and running it on your hardware. There's dragons here, around fair use, denial of service etc.
A guide on llama.cpp's grammars (nine hours and not a single mention of "GBNF"? HN is slipping) is here:
https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...
There's also a grammar validation tool in the default llama.cpp build, which is much easier to reason about for debugging grammars than having them bounce off the server.
If your masking is fast enough, you can make it easily work with spec dec too :). We manage to keep this on CPU. Some details here: https://github.com/guidance-ai/llguidance/blob/main/docs/opt...
Fireworks does. It is frustrating that AWS/Google/Azure do not.
https://fireworks.ai/docs/structured-responses/structured-ou...
OpenAI has started to (at least for tool calls): https://platform.openai.com/docs/guides/function-calling#con...
Nice, I missed this. Thanks.
Wouldn't that have implications for inference batching, since you would have to track state and apply a different mask for each sequence in the batch? If so, I think it would directly affect utilisation and hence costs. But I could be talking out of my ass here.
When you say custom grammar, do you mean something other than a JSON schema, because they support that?
I mean, most don't? I know you can provide a pseudo-EBNF grammar to llama.cpp but, for example, none of Anthropic, Azure, Bedrock, Mistral or Gemini allow us the same.
Another common stack that is commonly use is Langchain + Pydantic https://unstract.com/blog/comparing-approaches-for-using-llm...
This post dives into that "black magic" layer, especially in the context of emerging thinking models and tools like Ollama or GPT-OSS. It’s a thoughtful look at why sampling, formatting, and standardization are not just implementation details, but core to the future of working with LLMs.
I don't know if you're purposely trying to be funny, but this is obnoxious, lol