Why deterministic output from LLMs is nearly impossible

(unstract.com)

19 points | by naren87 6 hours ago ago

15 comments

This is a SaaS problem, not a LLM problem. If you have a local LLM that nobody is upgrading behind your back, it will calculate the same thing on the same inputs. Unless there is a bug somewhere, like using uninitialized memory, the flaoting-point calculations and the token embedding and all the rest do the same thing each time.

[-]

Cilvic 5 hours ago

So could SaaS LLM or cloud/api LLMs not offer this as an option? A guarantee that the "same prompt" will always produce the same result.

Also the way I usually interpret this "non-deterministic" a bit "broader".

Say i have have slightly different prompts "what's 2+2?" vs. "can you please tell me what's 2 plus 2" or even "2+2=?" or "2+2" for most applications it would be useful if they all produce the same result

[-]

alphan0n 4 hours ago

The form of the question determines the form of the outcome, even if the answer is the same. Asking the same question in a different way should result in the adherence to the form of the question.

2+2 is 4

2 plus 2 is 4

4=2+2

Having the LLM pass the input to a tool (python) will result in deterministic output.

nativeit 2 hours ago

Doesn’t that imply that LLMs are just “if then, then that” but bigger?

[-]

ezst 2 hours ago

Sure, why would you expect it to be different?

[-]

bloqs 18 minutes ago

Well I do because not a day has passed since 2021 where the general popular discourse on the subject of AI has not referenced it's functionality as fundamentally novel

lsy 4 hours ago

There are two additional aspects that are even more critical than the implementation details here:

- Typical LLM usage involves the accretion of context tokens from previous conversation turns. The likelihood that you will type prompt A twice but all of your previous context will be the same is low. You could reset the context, but accretion of context is often considered a feature of LLM interaction.

- Maybe more importantly, because the LLM abstraction is statistical, getting the correct output for e.g. "3 + 5 = ?" does not guarantee you will get the correct output for any other pair of numbers, even if all of the outputs are invariant and deterministic. So even if the individual prompt + output relationship is deterministic, the usefulness of the model output may "feel" nondeterministic between inputs, or have many of the same bad effects as nondeterminism. For the article's list of characteristics of deterministic systems, per-input determinism only solves "caching", and leaves "testing", "compliance", and "debuggability" largely unsolved.

orbital-decay 21 minutes ago

The author read the docs but never experimented, so they don't seem to have intuition behind the theory. For example, Gemini Flash actually seems to have deterministic outputs at temp 0, despite the disclaimer in the docs. Clearly Google has no trouble making it possible. Why don't they guarantee it, then? For starters it's inconvenient due to batching, you can see that in Gemini Pro which is "almost" deterministic but the same results are grouped together. It's a SaaS problem, if you run a model locally it's much easier to make it deterministic than presented in the article, and definitely not nearly impossible. It's going to cost you more, though.

But largely, you don't really want determinism. Imagine you have equal logprobs for "yes" and "no", which one should go into the output? With temperature 0 and greedy sampling it's going to be the same each time, depending on unrelated factors (e.g. which token index comes first in the vocabulary), and your outputs are going to be terribly skewed from what the model actually tries to tell you in the output distribution. What you're trying to solve with LLMs is inherently non-deterministic. It's either the same with humans and organizations (but you can't reset the state to measure it), or at least it depends on a myriad of little factors impossible to account for.

Besides, all current models have issues at temperature 0. Gemini in particular exhibits micro-repetitions and hallucinations (non-existent at higher temps) which it then tries to correct. Other models have other issues. This is a training-time problem, probably unsolvable at this point.

What you want is correctness, which is pretty different because the model works with concepts, not tokens. Try asking it what is 2x2. It might formulate the answer differently each time but good luck making it reply with anything else than 4 on a non-schizophrenic temperature. A bit of randomness won't prevent it from being consistently correct (or consistently incorrect).

jqpabc123 5 hours ago

Probabilistic processes are not the most appropriate way to produce deterministic results. And definitely not if the system is designed to update, grow or "learn" from inputs.

redsymbol 4 hours ago

There may be something I do not understand about LLMs. But it seems it is more correct to say LLMs are chaotic - in the mathematical sense of sensitive dependence on initial conditions.

The only actual nondeterminism is deliberately injected. E.g. the temperature parameter. Without that, it is deterministic but chaotic. This is the case both in training LLMs, and in using the trained models.

If I missed something, someone point it out please.

[-]

trod1234 3 hours ago

You aren't understanding the properties of Determinism, and many people even graduates of a Computer Science programe often don't have a working knowledge of this (the most competent do).

Its more correct to say that determinism occurs because the mathematical property is preserved or closed under its domain and the related operations. This connection becomes clear once you've taken an abstract algebra (modern algebra) course. It was a critical leap towards computers, based in the design of emergent systems.

The property can be broken quite easily by not preserving it, but then you have no way to tell the reliability of the output from randomness thereafter, and there is no concept of correctness in stochastic environments (where one token can be more than one token, and are not 'unique').

To put it plainly, Determinism is mathematical relabeling (i.e. a function test on the domain of operations that are performed).

While the constraints hold true, and the ISA and related stack maintain those constraints (i.e. are closed over those operations), you get reliable consistency. The property acts as an abstract guide rail to do work, which is how such simple combinations of circuit logic controlled by software can perform all the magical things we do and see today.

Time Invariance usually goes hand-in-hand with Determinism, and is needed for troubleshooting, and that usually requires memory-less properties, though it depends on where you are on the stack. Determinism is required for any automatic layer for reliability, and that is over the entire domain of possible things that can happen. Without Determinism, you run into halting and incompletness problems in classical Computer Science which have stood a good long test of time.

Error handling also generally stops working because you need to know and specify a state to match in order to handle a state, and that requires a determinable state in the first place.

A mapping of one unique input to one unique output, and projection onto are required for relabeling. The electronics are designed to preserve the property up to the logic layer.

The moment you have a unique item which is not actually unique, this is broken, and its real subtle. ldd for linux for example has two different but similar such types of these errors that remain unfixed (for over 10 years) because they weren't viewed as errors by the maintainers. This is to say that even long-term professional programmers (likely non-engineers) often lack in recognizing these types of foundational errors.

The result is the output of that utility prevents useful passing to any further automation because of the non-deterministically structured output. Specifically, the null token, and in-memory kernel structure tokens. Regex also requires these properties. You'll find there is at least one easily found instance of ldd on the ssh utility where you can't simply grep -ev to separate or filter material (to try and pigeonhole the output into a deterministic state), and even adding a DFA program sequentially can't be done to reverse this; a patch must occur at the point of error.

These crop up in production automation all the time, and usually are the most costly to fix given the required expertise needed to recognize the error. If determinism isn't present, no automation further downstream can be guaranteed to work. Determinism lets you constrain or expand the scope of a system in systems to narrow and home in on where the failure occurs.

Troubleshooting is an abstract application of testing for determinism, and you can easily tell when a problem won't have this tool available by probing inputs and outputs. In the absence of this property, you only have guess and check which requires intimate knowledge of the entirety of the system at all levels of abstraction. This is most costly in time given such documentation is almost never available.

As a final real-world example, consider an excel roster of employees at a large company, where you are only given the name of a person to shut down their account. What do you do when one person has the same name? What can you do without further input? Nothing. If you shut down both accounts, your fired, if you shut down the wrong account your fired, you have an indeterminable state.

The interactive layer is a lot more forgiving than the automatic layer because people can recognize when we need to get or provide more information.

Hopefully this clarifies your understanding.

[-]

kbelder 2 hours ago

I don't see anything you said that indicates the OP was incorrect in any way.

[-]

trod1234 2 hours ago

If that is the case, then you didn't read or comprehend what was actually said, and no one can tailor a response to people who can't read and comprehend.

There are important distinctions, its beyond the scope for me to try and guess at where that failure of comprehension might be for an individual such as yourself.

Basic reading comprehension would note: Properties are not individual inputs, they apply to the whole system as a relationship between input and output, individual inputs cannot define properties.

"Chaos" has a very rigorous definition (changes in small inputs lead to large changes in outputs).

"Injection of non-determinism" is only correct if it included a reference that determinism is built-in to all computation which is not a common understanding. Without that reference, the context improperly includes an indeterminable indirection resulting in fallacy.

The two are unrelated and independent to the context of the conversation or determinism, and so defining such understanding in those terms would result in fallacy (by improper isolation), delusion, or hallucination.

These are fundamental errors in reasoning and by extension understanding.

The correct, on firm foundations understanding, was provided. It is on the individual without knowledge to come into a conversation with the bare minimum requirements for comprehension based in rational thought and practice.

Edit: No amount of down-voting will change the truth of this, though I understand why someone would want useful knowledge to be hidden.

[-]

tacker2000 an hour ago

I downvoted you because your tone is unnecessarily harsh and rude.

[-]

trod1234 5 minutes ago

You should honestly re-evaluate and re-calibrate your measure of tone in moderation and relation to everything else.

Terse is not harsh or rude, its condensed, which carries a fine distinction.

Most business people and professionals speak this way; especially when it comes down to the objective facts which are not in question.

The facts and the effort towards minimization of cost for all parties in a communication conveys a overall respect, its extra effort I didn't have to provide which gets towards a specific goal as a whole for everyone involved in the communication's benefit.

If there is a mistake made on either parties part, its not harsh or rude to point out the mistake in such unambiguous format, or where that's not possible due to a deficit to point out why.

Elaborating in detail would be condescending, on the opposite side personal haranguing would be coercive imposition of cost.

You'll note I did neither, which is the socially acceptable way to handle it, and does not merit actions that were done.

The only generally understood acceptable middle-ground in those two extremes is terse and to the point, and when you eliminate both sides and the middle ground, you classify all communication as harsh and rude which is an absurdity.

People cannot read other people's minds, and the point of communication is to convey meaning in a way the parties involved can use it towards their own ends beneficially if they choose. That reflected appraisal is beneficial to all people involved.