>we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions
This sounds like a mistake. They used (among others) GPT2, which has pretty big space vectors. They also kind of arbitrarily define a collision threshold as an l2 distance smaller than 10^-6 for two vectors. Since the outputs are normalized, that corresponds to a ridiculously tiny patch on the surface of the unit sphere. Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal. I would expect the chance of two inputs to map to the same output under these constraints to be astronomically small (like less than one in 10^10000 or something). Even worse than your chances of finding a hash collision in sha256. Their claim certainly does not sound like something you could verify by testing a few billion examples. Although I'd love to see a detailed calculation. The paper is certainly missing one.
As I read it, what they did there was a sanity-check by trusting the birthday paradox. Kind of: "If you get orthogonal vectors due to mere chance once, that's okay, but you try it billions of times and still get orthogonal vectors every time, mere chance seems a very unlikely explanation."
It doesn't need a small number -- rather it relies on you being able to find a pairing amongst any of your candidates, rather than find a pairing for a specific birthday.
That's the paradoxical part: the number of potential pairings for a very small number of people is much higher than one might think, and so for 365 options (in the birthday example) you can get even chances with far fewer than 365, and even far fewer than ½x365 people..
I think you're misunderstanding. If you have an extremely large number like 2^256 you will almost certainly never find two people with the same birthday (this is why a SHA256 collision has never been found). That's what the top-level comment was comparing this to.
We're not using precise numbers here, but a large number of dimensions leads a very large number of options. 365 is only about 19^2, but 2^100 is astronomically larger than 10^9
The number of dimensions used is 768, wrote someone, and that isn't really very different from 365. But even if the number were big were were big, it could hardly escape fate: x has to be very big to keep (1-(1/x))¹⁰⁰⁰⁰⁰⁰⁰⁰⁰ near 1.
Just to clarify, the total dimension of birthdays is 365 (Jan 1 through Dec 31), but a 768 dimension continuous vector means there are 768 numbers, each of which can have values from -1 to 1 (at whatever precision floating point can represent). 1 float has about 2B numbers between -1 and 1 iirc, so 2B ^ 768 is a lot more than 365.
That assumes the random process by which vectors are generated places them at random angles to each other, it doesnt, it places them almost always very very nearly at (high-dim) right angles
The underlying geometry isnt random, to this order, it's determinstic
That would be purely statistic and not based on any algorithmic insight. In fact for hash functions it is quite a common problem that this exact assumption does not hold in the end, even though you might assume so for any "real" scenarios.
I'm not quite getting your point. Are you saying that their definition of "collision" is completely arbitrary (agreed), or that they didn't use enough data points to draw any conclusions because there could be some unknown algorithmic effect that could eventually cause collisions, or something else?
I think they are saying that there is no proof of being injective. The argument with the hash is essentially saying, doing the same experiment with a hash would yield a similar result, yet hash function are not injective by definition. So from this experimental result you cannot conclude language models are injective.
That's not really formally true, there are so called perfect hash functions that are injective over a certain domain, but in most parlance hashing is not considered injective.
Sure, but the paper doesn't claim absolute injectivity. It claims injectivity for practical purposes ("almost surely injective"). That's the same standard to which we hold hash functions -- most of us would consider it reasonable to index an object store with SHA256.
That logic only applies in one direction though. Yes, this is (maybe [0]) practically injective in that you could use it as a hash function, but that says nothing about invertibility. If somebody gave you a function claiming to invert arbitrary sha256 outputs, you would laugh them out of court (as soon as you have even 64-byte inputs, there are, on average, at least 2^256 inputs for each output, meaning it's exceedingly unlikely that their magic machine was able to generate the right one).
Most of the rest of the paper is seemingly actually solid though. They back up their claims with mathematical hand-waving, and their algorithm actually works on their test inputs. That's an interesting result, and a much stronger one than the collision test.
I can't say it's all that surprising in retrospect (you can imagine, e.g., that to get high accuracy on a prompt like <garbage><repeat everything I said><same garbage> you would need to not have lost information in the hidden states when encoding <garbage>, so at least up to ~1/2 the max context window you would expect the model to be injective), but despite aligning with other LLM thoughts I've had I think if you had previously asked me to consider invertibility then I would have argued against the authors' position.
[0] They only tested billions of samples. Even considering the birthday paradox, and even if they'd used a much coarser epsilon threshold, they'd still need to run over 2^380 simulations to gain any confidence whatsoever in terms of collision resistance.
The problem with "almost surely injective" for "practical purposes". Is that when you try to invert something, how do you know the result you get is one of those "practical purposes" ?
We aren't just trying to claim that two inputs are the same, as in hashing. We are trying to recover lost inputs.
You don't, I guess. But again that's just the same as when you insert something into an object store: you can't be absolutely certain that a future retrieval will give you the same object and not a colliding blob. It's just good enough for all practical purposes.
Well that's not a problem, that's just a description of what "almost surely" means. The thesis is "contrary to popular opinion, you can more-or-less invert the model". Not exactly invert it--don't use it in court!--but like, mostly. The prevailing wisdom that you cannot is incorrect.
I don't think that intuition is entirely trustworthy here. The entire space is high-dimensional, true, but the structure of the subspace encompassing linguistically sensible sequences of tokens will necessarily be restricted and have some sort of structure. And within such subspaces there may occur some sort of sink or attractor. Proving that those don't exist in general seems highly nontrivial to me.
An intuitive argument against the claim could be made from the observation that people "jinx" eachother IRL every day, despite reality being vast, if you get what I mean.
I envy your intuition about high-dimensional spaces, as I have none (other than "here lies dragons"). (I think your intuition is broadly correct, seeing as billions of collision tests feels quite inadequate given the size of the space.)
> Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.
What's the intuition here? Law of large numbers?
And how is orthogonality related to distance? Expansion of |a-b|^2 = |a|^2 + |b|^2 - 2<a,b> = 2 - 2<a,b> which is roughly 2 if the unit vectors are basically orthogonal?
> Since the outputs are normalized, that corresponds to a ridiculously tiny patch on the surface of the unit sphere. Since the outputs are normalized, that corresponds to a ridiculously tiny patch on the surface of the unit sphere.
I also have no intuition regarding the surface of the unit sphere in high-dimensional vector spaces. I believe it vanishes. I suppose this patch also vanishes in terms of area. But what's the relative rate of those terms going to zero?
> > Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.
> What's the intuition here? Law of large numbers?
Imagine for simplicity that we consider only vectors pointing parallel/antiparallel to coordinate axes.
- In 1D, you have two possibilities: {+e_x, -e_x}. So if you pick two random vectors from this set, the probability of getting something orthogonal is 0.
- In 2D, you have four possibilities: {±e_x, ±e_y}. If we pick one random vector and get e.g. +e_x, then picking another one randomly from the set has a 50% chance of getting something orthogonal (±e_y are 2/4 possibilities). Same for other choices of the first vector.
- In 3D, you have six possibilities: {±e_x, ±e_y, ±e_z}. Repeat the same experiment, and you'll find a 66.7% chance of getting something orthogonal.
- In the limit of ND, you can see that the chance of getting something orthogonal is 1 - 1/N, which tends to 100% as N becomes large.
Now, this discretization is a simplification of course, but I think it gets the intuition right.
I think that's a good answer for practical purposes.
Theoretically, I can claim that N random vectors of zero-mean real numbers (say standard deviation of 1 per element) will "with probability 1" span an N-dimensional space. I can even grind on, subtracting the parallel parts of each vector pair, until I have N orthogonal vectors. ("Gram-Schmidt" from high school.) I believe I can "prove" that.
So then mapping using those vectors is "invertible." Nyeah. But back in numerical reality, I think the resulting inverse will become practically useless as N gets large.
That's without the nonlinear elements. Which are designed to make the system non-invertible. It's not shocking if someone proves mathematically that this doesn't quite technically work. I think it would only be interesting if they can find numerically useful inverses for an LLM that has interesting behavior.
All -- I haven't thought very clearly about this. If I've screwed something up, please correct me gently but firmly. Thanks.
for 768 dimensions, you'd still expect to hit (1-1/N) with a few billion samples though. Like that's a 1/N of 0.13%, which quite frankly isn't that rare at all?
Of course are vectors are not only points in one coordinate axes, but it still isn't that small compared to billions of samples.
> What's the intuition here? Law of large numbers?
For unit vectors the cosine of the angle between them is a1*b1+a2*b2+...+an*bn.
Each of the terms has mean 0 and when you sum many of them the sum concentrates closer and closer to 0 (intuitively the positive and negative terms will tend to cancel out, and in fact the standard deviation is 1/√n).
> > Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.
> What's the intuition here? Law of large numbers?
Yep, the large number being the number of dimensions.
As you add another dimension to a random point on a unit sphere, you create another new way for this point to be far away from a starting neighbor. Increase the dimensions a lot and then all random neighbors are on the equator from the starting neighbor. The equator being a 'hyperplane' (just like a 2D plane in 3D) of dimension n-1, the normal of which is the starting neighbor, intersected with the unit sphere (thus becoming a n-2 dimensional 'variety', or shape, embedded in the original n dimensional space; like the earth's equator is 1 dimensional object).
The mathematical name for this is 'concentration of measure' [1]
It feels weird to think about it, but there's also a unit change in here. Paris is about 1/8 of the circle far away from the north pole (8 such angle segments of freedom). On a circle. But if that's the definition of location of Paris, on the 3D earth there would be an infinity of Paris. There is only one though. Now if we take into account longitude, we have Montreal, Vancouver, Tokyo, etc ; each 1/8 away (and now we have 64 solid angle segments of freedom)
It doesn't really matter which vector you are looking at, since they are using a tiny constraint in a high dimensional continuous space. There's gotta be an unfathomable amount of vectors you can fit in there. Certainly more than a few billion.
> Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.
Which, incidentally, is the main reason why deep learning and LLM are effective in the first place.
A vector of a few thousands dimensions would be woefully inadequate to represent all of human knowledge, if not for the fact that it works as the projection of a much higher, potentially infinite-dimensional vector representing all possible knowledge. The smaller-sized one works in practice as a projection, precisely because any two such vectors are almost always orthogonal.
Two random vectors are almost always neither collinear nor orthogonal. So what you mean is either "not collinear", which is a trivial statement, or something like "their dot product is much smaller than abs(length(vecA) * length(vecB))", which is probably interesting but still not very clear.
Well, the actual interesting part is that when the vector dimension grows then random vectors will become almost orthogonal. smth smth exponential number of almost orthogonal vectors. this is probably the most important reason why text embedding is working. you take some structure from a 10^6 dimension, and project it to 10^3 dimension, and you can still keep the distances between all vectors.
I remember hearing an argument once that said LLMs must be capable of learning abstract ideas because the size of their weight model (typically GBs) is so much smaller than the size of their training data (typically TBs or PBs). So either the models are throwing away most of the training data, they are compressing the data beyond the known limits, or they are abstracting the data into more efficient forms. That's why an LLM (I tested this on Grok) can give you a summary of chapter 18 of Mary Shelley's Frankenstein, but cannot reproduce a paragraph from the same text verbatim.
I am sure I am not understanding this paper correctly because it sounds like they are claiming that model weights can be used to produce the original input text representing an extraordinary level of text compression.
> If I am understanding this paper correctly, they are claiming that the model weights can be inverted in order to produce the original input text.
No, that is not the claim at all. They are instead claiming that given an LLM output that is a summary of chapter 18 of Mary Shelley's Frankenstein, you can tell that the input prompt that led to this output was "give me a summary of chapter 18 of Mary Shelley's Frankenstein". Of course, this relies on the exact wording: for this to be true, it means that if you had asked "give me a summary of chapter 18 of Frankenstein by Mary Shelley", you would necessarily receive a (slightly) different result.
Importantly, this needs to be understood as a claim about an LLM run with temperature = 0. Obviously, if the infra introduces randomness, this result no longer perfectly holds (but there may still be a way to recover it by running a more complex statistical analysis of the results, of course).
Edit: their claim may be something more complex, after reading the paper. I'm not sure that their result applies to the final output, or it's restricted to knowing the internal state at some pre-output layer.
> their claim may be something more complex, after reading the paper. I'm not sure that their result applies to the final output, or it's restricted to knowing the internal state at some pre-output layer.
It's the internal state; that's what they mean by "hidden activations".
If the claim were just about the output it'd be easy to falsify. For example, the prompts "What color is the sky? Answer in one word." and "What color is the "B" in "ROYGBIV"? Answer in one word." should both result in the same output ("Blue") from any reasonable LLM.
Even that is not necessarily true. The output of the LLM is not "Blue". It is something like "probability of 'Blue' is 0.98131". And it may well be 0.98132 for the other question. Certainly they only talk about the internal state in 1 layer of the LLM, they don't need the entire LLM values.
The point I'm trying to make is this: the LLM output is a set of activations. Those are not "hidden" in any way: that is the plain result of running the LLM. Displaying the word "Blue" based on the LLM output is a separate step, one that the inference server performs, completely outside the scope of the LLM.
However, what's unclear to me from the paper is if it's enough to get these activations from the final output layer; or if you actually need some internal activations from a hidden layer deeper in the LLM, one that does require analyzing the internal state of the LLM.
The LLM proper will never answer "yes" or "no". It will answer something like "Yes - 99.75%; No - 0.0007%; Blue - 0.0000007%; This - 0.000031%" etc , for all possible tokens. It is this complete response that is apparently unique.
With regular LLM interactions, the inference server then takes this output and actually picks one of these responses using the probabilities. Obviously, that is a lossy and non-injective process.
If the authors are correct (I'm not equipped to judge) then there must be additional output which is thrown away before the user is presented with their yes/no, which can be used to recover the prompt.
It would be pretty cool if this were true. One could annotate results with this metadata as a way of citing sources.
Why do people not believe that LLMs are invertible when we had GPT-2 acting as a lossless text compressor for a demo? That's based on exploiting the invertibility of a model...
I was under the impression that without also forcing the exact seed (which is randomly chosen and usually obfuscated), even providing the same exact prompt is unlikely to provide the same exact summary. In other words, under normal circumstances you can't even prove that a prompt and response are linked.
The seed and the actual randomness is a property of the inferencing infrastructure, not the LLM. The LLM outputs probabilities, essentially.
The paper is not claiming that you can take a dump of ChatGPT responses over the network and figure out what prompts were given. It's much more about a property of the LLM internally.
I'm under the impression that seed only effects anything if temperature > 0. Or more specifically that the LLM given a sequence of input tokens deterministically outputs the probability for each possible next token, and then the only source of randomness is in the procedure for selecting which of those next tokens to use. And that temperature = 0 means the procedure is "select the most likely one" with no randomness at all.
For sure! Measuring parameters given data is central to statistics. It’s a way to concentrate information for practical use. Sufficient statistics are very interesting, bc once computed, they provably contain as much information as the data (lossless). Love statistics, it’s so cool!
Imagine that you have an a spreadsheet that dates from the beginning of the universe to its end. It contains two columns: the date, and how many days it has been since the universe was born. That's very big spreadsheet with lots of data in it. If you plot it, it creates a seemingly infinite diagonal line.
But it can be "abstracted" as Y=X. And that's what ML does.
I don't think it's the same thing because an abstraction is still tangible. For example, "rectangle" is an abstraction for all sorts of actual rectangular shapes you can find in practice. We have a way to define what a rectangle is and to identify one.
A neural network doesn't have any actual conceptual backing for what it is doing. It's pure math. There are no abstracted properties beyond the fact that by coincidence the weights make a curve fit certain points of data.
If there was truly a conceptual backing for these "abstractions" then multiple models trained on the same data should have very similar weights as there aren't multiple ways to define the same concepts, but I doubt that this happens in practice. Instead the weights are just randomly adjusted until they fit the points of data without any respect given to whether there is any sort of cohesion. It's just math.
That's like saying multiple programs compiled by different compilers from the same sources should have very similar binaries. You're looking in the wrong place! Similarities are to be expected in the structure of the latent space, not in model weights.
> Wouldn't that mean LLMs represent an insanely efficient form of text compression?
This is a good question worth thinking about.
The output, as defined here (I'm assuming by reading the comment thread), is a set of one value between 0 and 1 for every token the model can treat as "output". The fact that LLM tokens tend not to be words makes this somewhat difficult to work with. If there are n output tokens and the probability the model assigns to each of them is represented by a float32, then the output of the model will be one of at most (2³²)ⁿ = 2³²ⁿ values; this is an upper bound on the size of the output universe.
The input is not the training data but what you might think of as the prompt. Remember that the model answers the question "given the text xx x xxx xxxxxx x, what will the next token in that text be?" The input is the text we're asking about, here xx x xxx xxxxxx x.
The input universe is defined by what can fit in the model's context window. If it's represented in terms of the same tokens that are used as representations of output, then it is bounded above by n+1 (the same n we used to bound the size of the output universe) to the power of "the length of the context window".
Let's assume there are maybe somewhere between 10,000 and 100,000 tokens, and the context window is 32768 (2¹⁵) tokens long.
Say there are 16384 = 2^14 tokens. Then our bound on the input universe is roughly (2^14)^(2^15). And our bound on the output universe is roughly 2^[(2^5)(2^14)] = 2^(2^19).
(2^14)^(2^15) = 2^(14·2^15) < 2^(16·2^15) = 2^(2^19), and 2^(2^19) was our approximate number of possible output values, so there are more potential output values than input values and the output can represent the input losslessly.
For a bigger vocabulary with 2^17 (=131,072) tokens, this conclusion won't change. The output universe is estimated at (2^(2^5))^(2^17) = 2^(2^22); the input universe is (2^17)^(2^15) = 2^(17·2^15) < 2^(32·2^15) = 2^(2^20). This is a huge gap; we can see that in this model, more vocabulary tokens blow up the potential output much faster than they blow up the potential input.
What if we only measured probability estimates in float16s?
Then, for the small 2^14 vocabulary, we'd have roughly (2^16)^(2^14) = 2^(2^18) possible outputs, and our estimate of the input universe would remain unchanged, "less than 2^(2^19)", because the fineness of probability assignment is a concern exclusive to the output. (The input has its own exclusive concern, the length of the context window.) For this small vocabulary, we're not sure whether every possible input can have a unique output. For the larger one, we'll be sure again - the estimate for output will be a (reduced!) 2^(2^21) possible values, but the estimate for input will be an unchanged 2^(2^20) possible values, and once again each input can definitely be represented by a unique output.
So the claim looks plausible on pure information-theory grounds. On the other hand, I've appealed to some assumptions that I'm not sure make sense in general.
> That's why an LLM (I tested this on Grok) can give you a summary of chapter 18 of Mary Shelley's Frankenstein, but cannot reproduce a paragraph from the same text verbatim.
I have some issues with the substance of this, but more to the point it characterizes the problem incorrectly. Frankenstein is part of the training data, not part of the input.
> That's why an LLM (I tested this on Grok) can give you a summary of chapter 18 of Mary Shelley's Frankenstein, but cannot reproduce a paragraph from the same text verbatim.
I have not known an LLM to be able to summarise a book found in its training data, unless it had many summaries to plagiarise (in which case, actually having the book is unnecessary). I have no reason to believe the training process should result in "abstracting the data into more efficient forms". "Throwing away most of the training data" is an uncharitable interpretation (what they're doing is more sophisticated than that) but, I believe, a correct one.
I don't like the title of this paper, since most people in this space probably think of language models not as producing a distribution (wrt which they are indeed invertible, which is what the paper claims) but as producing tokens (wrt which they are not invertible [0]).
Also the author contribution statement made me laugh.
But I bet you could reconstruct a plausible set of distributions by just rerunning the autoregression on a given text with the same model. You won't invert the exact prompt but it could give you a useful approximation.
Still, it is technically correct. The model produces a next-token likelihood distribution, then you apply a sampling strategy to produce a sequence of tokens.
Depends on your definition of the model. Most people would be pretty upset with the usual LLM providers if they drastically changed the sampling strategy for the worse and claimed to not have changed the model at all.
Sure, but they went slightly overboard with that headline and they knew it. But oh well, they have a lot of eyes and discussion on their paper so it's a success.
I feel like, if the feedback to your paper is "this is over-done / they claim more than they prove / it's kinda hype-ish" you're going to get less references in future papers.
That would seem to be counter to the "impact" goal for research.
Fair enough, that might be more my personal opinion instead of sound advice for successful research. Also I understand that you have a very limited amount of time to get your research noticed in this topic. Who knows if it's relevant two years down the line.
LLM providers are in the stone age with sampling today and it's on purpose because better sampling algorithms make the diversity of synthetic generated data too high, thus meaning your model is especially vulnerable to distillation attacks.
This is why we use top_p/top_k on the big 3 closed source models despite min_p and far better LLM sampling algorithms existing since 2023 (or in TFS case, since 2019)
So, scientists came up with a very specific term for a very specific function, its meaning got twisted by commercial actors and the general public, and it's now the scientists' fault if they keep using it in the original, very specific sense?
I agree it is technically correct, but I still think it is the research paper equivalent of clickbait (and considering enough people misunderstood this for them to issue a semi-retraction that seems reasonable)
I disagree. Within the research community (which is the target of the paper), that title means something very precise and not at all clickbaity. It's irrelevant that the rest of the Internet has an inaccurate notion of "model" and other very specific terms.
It has a huge implication for privacy. There is some "mental model" that embedding vectors are like hash - so you can store them in database, even though you would not store plain text.
It is an incorrect assumption - as a good embedding stores ALL - not just the general gist, but dates, names, passwords.
There is an easy fix to that - a random rotation; preserves all distances.
This mental model is also in direct contradiction to the whole purpose of the embedding, which is that the embedding describes the original text in a more interpretable form. If a piece of content in the original can be used for search, comparison etc., p much by definition it has to be stored in the embedding.
Similarly, this result can be rephrased as "Language Models process text." If the LLM wasn't invertible with regards to a piece of input text, it couldn't attend to this text either.
> There is an easy fix to that - a random rotation; preserves all distances.
Is that like homomorphic encryption, in a sense, where you can calculate the encryption of a function on the plaintext, without ever seeing the input or calculated function of plaintext.
-Different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space
- Injectivity is not accidental, but a structural property of language models
- Across billions of prompt pairs and several model sizes, we find no collisions: no two prompts are mapped to the same hidden states
- We introduce SipIt, an algorithm that exactly reconstructs the input from hidden states in guaranteed linear time.
- This impacts privacy, deletion, and compliance: once data enters a Transformer, it remains recoverable.
If you claim, for example, that an input is not stored, but examples of internal steps of an inference run _is_ retained, then this paper may suggest a means for recovering the input prompt.
Here's an output text: "Yes." Recover the exact input that led to it. (you can't, because the hidden state is already irreversibly collapsed during the sampling of each token)
The paper doesn't claim this to be possible either, they prove the reversibility of the mapping between the input and the hidden state, not the output text. Or rather "near-reversibility", i.e. collisions are technically possible but they have to be very precisely engineered during the model training and don't normally happen.
- If you have a feature detector function (f(x) = 0 when feature is not present, f(x) = 1 when feature is present) and you train a network to compute f(x), or some subset of the network "decides on its own during training" to compute f(x), doesn't that create a zero set of non-zero measure if training continues long enough?
- What happens when the middle layers are of much lower dimension than the input?
- Real analyticity means infinitely many derivatives (according to Appendix A). Does this mean the results don't apply to functions with corners (e.g. ReLU)?
Author of related work here. This is very cool! I was hoping that they would try to invert layer by layer from the output to the input but it seems that they do a search process at the input layer instead. They rightly point out the residual connections make a layer by layer approach difficult. I may point out though that an rmsnorm layer should be invertible due to the epsilon term in the denominator which can be used to recover the input magnitude
In layman's terms, this seems to mean that given a certain unedited LLM output, plus complete information about the LLM, they can determine what prompt was used to create the output. Except that in practice this works almost never. Am I understanding correctly?
No, it's about the distribution being injective, not a single sampled response. So you need a lot of outputs of the same prompt, and know the LLM, and then you should in theory be able to reconstruct the original prompt.
I'm wondering how this might be summarized in simple terms? It sounds like, after processing some text, the entire prompt is included in the in-memory internal state of the program that's doing inference.
But it seems like it would need to remember the prompt to answer questions about it. How does this interact with the attention mechanism?
My understanding is that they claim that for every unique prompt there is a unique final state of the LLM. Isn't that patently false due to the finite state of the LLM and the ability (in principle, at least) to input arbitrarily large number of unique prompts?
I think their "almost surely" is doing a lot of work.
A more consequential result would give the probability of LLM state collision as a function of the number of unique prompts.
As is, they are telling me that I "almost surely" will not hit the bullseye of a dart board. While likely true, it's not saying much.
I think their claims are limited to the "theoretical" LLM, not to the way we typically use one.
The LLM itself has a fixed size input and a fixed size, deterministic output. The input is the initial value for each neuron in the input layer. The LLM output is the vector of final outputs of each neuron in the output layer. For most normal interactions, these vectors are almost entirely 0s.
Of course, when we say LLM, we typically consider infrastructure that abstracts these things for us. Especially we typically use infra that takes the LLM outputs as probabilities, and thus typically produces different results even for the exact same input - but that's just a choice in how to interpret these values, the values themselves are identical. Similarly on the input side, the max input is typically called a "context window". You can feed more input into the LLM infra than the context window, but that's not actual input to the model itself - the LLM infra will simply pick a part of your input and feed that part into the model weights.
I think I'm misunderstanding the abstract, but are they trying to say that given a LLM output, they can tell me what the input is? Or given an output AND the intermediate layer weights? If it is the first option, I could use as input 1 "Only respond with 'OK'" and "Please only respond with 'OK'" which leads to 2 inputs producing the same output.
LLMs produce a distribution from which to sample the next token.
Then there's a loop that samples the next token and feeds it back to to the model until it samples a EndOfSequence token.
In your example the two distributions might be {"OK": 0.997, EOS: 0.003} vs {"OK": 0.998, EOS: 0.002} and what I think the authors claim is that they can invert that distribution to find which input caused it.
I don't know how they go beyond one iteration, as they surely can't deterministically invert the sampling.
Edit: reading the paper, I'm no longer sure about my statement below. The algorithm they introduce claims to do this: "We now show how this property can be used
in practice to reconstruct the exact input prompt given hidden states at some layer [emp. mine]". It's not clear to me from the paper if this layer can also be the final output layer, or if it must be a hidden layer.
They claim that they can reverse the LLM (get prompt from LLM response) by only knowing the output layer values, the intermediate layers remain hidden. So, Their claim is that indeed you shouldn't be able to do that (note that this claim applies to the numerical model outputs, not necessarily to the output a chat interface would show you, which goes through some randomization).
This claim's so big that it requires theoretical proof, empirical analysis isn't convincing (given the size of the claim). Causal inference experts have long known that many inputs map to outputs (that's why identification of the inputs that actually caused a given output is a never-ending task).
What this paper suggests is that LLM hidden states actually preserve inputs with high semantic fidelity. If that’s the case, then the real distortion isn’t inside the network, it’s in the optimization trap at the decoding layer, where rich representations get collapsed into outputs that feel synthetic or generic. In other words, the math may be lossless, but the interface is where meaning erodes.
This is very similar (and maybe even the same thing) to some recent work (published earlier this year) by the people at Ritual AI on attacking attempts to obfuscate LLM inference (which leads to the design for their defense against this, which involves breaking up the prompt token sequences and handing them to multiple computers, making it so no individual machine has access to sufficient states from the hidden layer in a row).
"For reasons like this, "in-context learning" is not an accurate term for transformers. It's projection and storage, nothing is learnt.
This new paper has attracted a lot of interest, and it's nice that it proves things formally and empirically, but it looks like people are surprised by it, even though it was clear."
paper looks nice! i think what they found was that they can recover the input sequence by trying all tokens from the vocab and finding a unique state. they do a forward pass to check each possible token at a given depth. i think this is since the model will encode the sequence in the mid flight token so this encoding is revealed to be unique by their paper. so one prompt of 'the cat sat on the mat' and 'the dog sat on the mat' can be recovered as distinct states via each token being encoded (unclear mechanism but it would be shocking if this wasn't the case) in the token (mid flight residual).
Could this be a way to check for AI plagiarism? Given a chunk of text would you be able to (almost) prove that it came from a prompt saying "Write me a short essay on ___" ?
"And hence invertible" <- does every output embedding combination have an associated input ? Are they able to construct it or is this just an existence result ?
I don't think they're claiming surjectivity here. They're just saying the mapping is injective, so for a given output there should be a unique input to construct it.
They can't. Biological neural networks have no resemblance to the artificial neural networks of the kind used in LLMs. The only similarity is based on a vague computational abstraction of the very primitive understanding of how brains and nerve cells worked we had in the 50s when the first "neural network" was invented.
I find this interesting. I have tools that attempt to reverse engineer black box models through auto-prompting and analysis of the outputs/tokens. I have used this to develop prompt injection attacks that "steer" output, but have never tried to use the data to recreate an exact input...
tldr: Seeing what happens internally in an LLM lets you reconstruct the original prompt exactly.
Maybe not surprising if you logged all internal activity, but it can be done from only a single snapshot of hidden activations from the standard forward pass.
>we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions
This sounds like a mistake. They used (among others) GPT2, which has pretty big space vectors. They also kind of arbitrarily define a collision threshold as an l2 distance smaller than 10^-6 for two vectors. Since the outputs are normalized, that corresponds to a ridiculously tiny patch on the surface of the unit sphere. Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal. I would expect the chance of two inputs to map to the same output under these constraints to be astronomically small (like less than one in 10^10000 or something). Even worse than your chances of finding a hash collision in sha256. Their claim certainly does not sound like something you could verify by testing a few billion examples. Although I'd love to see a detailed calculation. The paper is certainly missing one.
As I read it, what they did there was a sanity-check by trusting the birthday paradox. Kind of: "If you get orthogonal vectors due to mere chance once, that's okay, but you try it billions of times and still get orthogonal vectors every time, mere chance seems a very unlikely explanation."
Edit: there are other clarifications, eg authors on X, so this comment is irrelevant.
The birthday paradox relies on there being a small number of possible birthdays (365-366).
There are not a small number of dimensions being used in the LLM.
The GP argument makes sense to me.
It doesn't need a small number -- rather it relies on you being able to find a pairing amongst any of your candidates, rather than find a pairing for a specific birthday.
That's the paradoxical part: the number of potential pairings for a very small number of people is much higher than one might think, and so for 365 options (in the birthday example) you can get even chances with far fewer than 365, and even far fewer than ½x365 people..
I think you're misunderstanding. If you have an extremely large number like 2^256 you will almost certainly never find two people with the same birthday (this is why a SHA256 collision has never been found). That's what the top-level comment was comparing this to.
We're not using precise numbers here, but a large number of dimensions leads a very large number of options. 365 is only about 19^2, but 2^100 is astronomically larger than 10^9
The birthday paradox equation is approximately the square root. You expect to find a collision in 365 possibilities in ~sqrt(365) = ~19 tries.
You expect to find a collision in 2^256 possibilities in ~sqrt(2^256) = ~2^128 tries.
You expect to find a collision in 10^10000 possibilities in ~sqrt(10^10000) = ~10^5000 tries.
The number of dimensions used is 768, wrote someone, and that isn't really very different from 365. But even if the number were big were were big, it could hardly escape fate: x has to be very big to keep (1-(1/x))¹⁰⁰⁰⁰⁰⁰⁰⁰⁰ near 1.
Just to clarify, the total dimension of birthdays is 365 (Jan 1 through Dec 31), but a 768 dimension continuous vector means there are 768 numbers, each of which can have values from -1 to 1 (at whatever precision floating point can represent). 1 float has about 2B numbers between -1 and 1 iirc, so 2B ^ 768 is a lot more than 365.
I may have misunderstood — don't they test for orthogonality? Orthogonality would seem to drop much of the information in the vectors.
That assumes the random process by which vectors are generated places them at random angles to each other, it doesnt, it places them almost always very very nearly at (high-dim) right angles
The underlying geometry isnt random, to this order, it's determinstic
The nature of high-dimensional spaces kind of intuitively supports the argument for invertability though, no? In the sense that:
> I would expect the chance of two inputs to map to the same output under these constraints to be astronomically small.
That would be purely statistic and not based on any algorithmic insight. In fact for hash functions it is quite a common problem that this exact assumption does not hold in the end, even though you might assume so for any "real" scenarios.
> That would be purely statistic and not based on any algorithmic insight.
This is machine learning research ?
I'm not quite getting your point. Are you saying that their definition of "collision" is completely arbitrary (agreed), or that they didn't use enough data points to draw any conclusions because there could be some unknown algorithmic effect that could eventually cause collisions, or something else?
I think they are saying that there is no proof of being injective. The argument with the hash is essentially saying, doing the same experiment with a hash would yield a similar result, yet hash function are not injective by definition. So from this experimental result you cannot conclude language models are injective.
That's not really formally true, there are so called perfect hash functions that are injective over a certain domain, but in most parlance hashing is not considered injective.
Sure, but the paper doesn't claim absolute injectivity. It claims injectivity for practical purposes ("almost surely injective"). That's the same standard to which we hold hash functions -- most of us would consider it reasonable to index an object store with SHA256.
That logic only applies in one direction though. Yes, this is (maybe [0]) practically injective in that you could use it as a hash function, but that says nothing about invertibility. If somebody gave you a function claiming to invert arbitrary sha256 outputs, you would laugh them out of court (as soon as you have even 64-byte inputs, there are, on average, at least 2^256 inputs for each output, meaning it's exceedingly unlikely that their magic machine was able to generate the right one).
Most of the rest of the paper is seemingly actually solid though. They back up their claims with mathematical hand-waving, and their algorithm actually works on their test inputs. That's an interesting result, and a much stronger one than the collision test.
I can't say it's all that surprising in retrospect (you can imagine, e.g., that to get high accuracy on a prompt like <garbage><repeat everything I said><same garbage> you would need to not have lost information in the hidden states when encoding <garbage>, so at least up to ~1/2 the max context window you would expect the model to be injective), but despite aligning with other LLM thoughts I've had I think if you had previously asked me to consider invertibility then I would have argued against the authors' position.
[0] They only tested billions of samples. Even considering the birthday paradox, and even if they'd used a much coarser epsilon threshold, they'd still need to run over 2^380 simulations to gain any confidence whatsoever in terms of collision resistance.
The problem with "almost surely injective" for "practical purposes". Is that when you try to invert something, how do you know the result you get is one of those "practical purposes" ?
We aren't just trying to claim that two inputs are the same, as in hashing. We are trying to recover lost inputs.
You don't, I guess. But again that's just the same as when you insert something into an object store: you can't be absolutely certain that a future retrieval will give you the same object and not a colliding blob. It's just good enough for all practical purposes.
Well that's not a problem, that's just a description of what "almost surely" means. The thesis is "contrary to popular opinion, you can more-or-less invert the model". Not exactly invert it--don't use it in court!--but like, mostly. The prevailing wisdom that you cannot is incorrect.
I don't think that intuition is entirely trustworthy here. The entire space is high-dimensional, true, but the structure of the subspace encompassing linguistically sensible sequences of tokens will necessarily be restricted and have some sort of structure. And within such subspaces there may occur some sort of sink or attractor. Proving that those don't exist in general seems highly nontrivial to me.
An intuitive argument against the claim could be made from the observation that people "jinx" eachother IRL every day, despite reality being vast, if you get what I mean.
I envy your intuition about high-dimensional spaces, as I have none (other than "here lies dragons"). (I think your intuition is broadly correct, seeing as billions of collision tests feels quite inadequate given the size of the space.)
> Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.
What's the intuition here? Law of large numbers?
And how is orthogonality related to distance? Expansion of |a-b|^2 = |a|^2 + |b|^2 - 2<a,b> = 2 - 2<a,b> which is roughly 2 if the unit vectors are basically orthogonal?
> Since the outputs are normalized, that corresponds to a ridiculously tiny patch on the surface of the unit sphere. Since the outputs are normalized, that corresponds to a ridiculously tiny patch on the surface of the unit sphere.
I also have no intuition regarding the surface of the unit sphere in high-dimensional vector spaces. I believe it vanishes. I suppose this patch also vanishes in terms of area. But what's the relative rate of those terms going to zero?
> > Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.
> What's the intuition here? Law of large numbers?
Imagine for simplicity that we consider only vectors pointing parallel/antiparallel to coordinate axes.
- In 1D, you have two possibilities: {+e_x, -e_x}. So if you pick two random vectors from this set, the probability of getting something orthogonal is 0.
- In 2D, you have four possibilities: {±e_x, ±e_y}. If we pick one random vector and get e.g. +e_x, then picking another one randomly from the set has a 50% chance of getting something orthogonal (±e_y are 2/4 possibilities). Same for other choices of the first vector.
- In 3D, you have six possibilities: {±e_x, ±e_y, ±e_z}. Repeat the same experiment, and you'll find a 66.7% chance of getting something orthogonal.
- In the limit of ND, you can see that the chance of getting something orthogonal is 1 - 1/N, which tends to 100% as N becomes large.
Now, this discretization is a simplification of course, but I think it gets the intuition right.
I think that's a good answer for practical purposes.
Theoretically, I can claim that N random vectors of zero-mean real numbers (say standard deviation of 1 per element) will "with probability 1" span an N-dimensional space. I can even grind on, subtracting the parallel parts of each vector pair, until I have N orthogonal vectors. ("Gram-Schmidt" from high school.) I believe I can "prove" that.
So then mapping using those vectors is "invertible." Nyeah. But back in numerical reality, I think the resulting inverse will become practically useless as N gets large.
That's without the nonlinear elements. Which are designed to make the system non-invertible. It's not shocking if someone proves mathematically that this doesn't quite technically work. I think it would only be interesting if they can find numerically useful inverses for an LLM that has interesting behavior.
All -- I haven't thought very clearly about this. If I've screwed something up, please correct me gently but firmly. Thanks.
for 768 dimensions, you'd still expect to hit (1-1/N) with a few billion samples though. Like that's a 1/N of 0.13%, which quite frankly isn't that rare at all?
Of course are vectors are not only points in one coordinate axes, but it still isn't that small compared to billions of samples.
> What's the intuition here? Law of large numbers?
For unit vectors the cosine of the angle between them is a1*b1+a2*b2+...+an*bn.
Each of the terms has mean 0 and when you sum many of them the sum concentrates closer and closer to 0 (intuitively the positive and negative terms will tend to cancel out, and in fact the standard deviation is 1/√n).
> > Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.
> What's the intuition here? Law of large numbers?
Yep, the large number being the number of dimensions.
As you add another dimension to a random point on a unit sphere, you create another new way for this point to be far away from a starting neighbor. Increase the dimensions a lot and then all random neighbors are on the equator from the starting neighbor. The equator being a 'hyperplane' (just like a 2D plane in 3D) of dimension n-1, the normal of which is the starting neighbor, intersected with the unit sphere (thus becoming a n-2 dimensional 'variety', or shape, embedded in the original n dimensional space; like the earth's equator is 1 dimensional object).
The mathematical name for this is 'concentration of measure' [1]
It feels weird to think about it, but there's also a unit change in here. Paris is about 1/8 of the circle far away from the north pole (8 such angle segments of freedom). On a circle. But if that's the definition of location of Paris, on the 3D earth there would be an infinity of Paris. There is only one though. Now if we take into account longitude, we have Montreal, Vancouver, Tokyo, etc ; each 1/8 away (and now we have 64 solid angle segments of freedom)
[1] https://www.johndcook.com/blog/2017/07/13/concentration_of_m...
> What's the intuition here? Law of large numbers?
"Concentration of measure"
https://en.wikipedia.org/wiki/Concentration_of_measure
I think that the latent space that GPT-2 uses has 768 dimensions (i.e. embedding vectors have that many components).
It doesn't really matter which vector you are looking at, since they are using a tiny constraint in a high dimensional continuous space. There's gotta be an unfathomable amount of vectors you can fit in there. Certainly more than a few billion.
> Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.
Which, incidentally, is the main reason why deep learning and LLM are effective in the first place.
A vector of a few thousands dimensions would be woefully inadequate to represent all of human knowledge, if not for the fact that it works as the projection of a much higher, potentially infinite-dimensional vector representing all possible knowledge. The smaller-sized one works in practice as a projection, precisely because any two such vectors are almost always orthogonal.
Two random vectors are almost always neither collinear nor orthogonal. So what you mean is either "not collinear", which is a trivial statement, or something like "their dot product is much smaller than abs(length(vecA) * length(vecB))", which is probably interesting but still not very clear.
Well, the actual interesting part is that when the vector dimension grows then random vectors will become almost orthogonal. smth smth exponential number of almost orthogonal vectors. this is probably the most important reason why text embedding is working. you take some structure from a 10^6 dimension, and project it to 10^3 dimension, and you can still keep the distances between all vectors.
I remember hearing an argument once that said LLMs must be capable of learning abstract ideas because the size of their weight model (typically GBs) is so much smaller than the size of their training data (typically TBs or PBs). So either the models are throwing away most of the training data, they are compressing the data beyond the known limits, or they are abstracting the data into more efficient forms. That's why an LLM (I tested this on Grok) can give you a summary of chapter 18 of Mary Shelley's Frankenstein, but cannot reproduce a paragraph from the same text verbatim.
I am sure I am not understanding this paper correctly because it sounds like they are claiming that model weights can be used to produce the original input text representing an extraordinary level of text compression.
> If I am understanding this paper correctly, they are claiming that the model weights can be inverted in order to produce the original input text.
No, that is not the claim at all. They are instead claiming that given an LLM output that is a summary of chapter 18 of Mary Shelley's Frankenstein, you can tell that the input prompt that led to this output was "give me a summary of chapter 18 of Mary Shelley's Frankenstein". Of course, this relies on the exact wording: for this to be true, it means that if you had asked "give me a summary of chapter 18 of Frankenstein by Mary Shelley", you would necessarily receive a (slightly) different result.
Importantly, this needs to be understood as a claim about an LLM run with temperature = 0. Obviously, if the infra introduces randomness, this result no longer perfectly holds (but there may still be a way to recover it by running a more complex statistical analysis of the results, of course).
Edit: their claim may be something more complex, after reading the paper. I'm not sure that their result applies to the final output, or it's restricted to knowing the internal state at some pre-output layer.
> their claim may be something more complex, after reading the paper. I'm not sure that their result applies to the final output, or it's restricted to knowing the internal state at some pre-output layer.
It's the internal state; that's what they mean by "hidden activations".
If the claim were just about the output it'd be easy to falsify. For example, the prompts "What color is the sky? Answer in one word." and "What color is the "B" in "ROYGBIV"? Answer in one word." should both result in the same output ("Blue") from any reasonable LLM.
Even that is not necessarily true. The output of the LLM is not "Blue". It is something like "probability of 'Blue' is 0.98131". And it may well be 0.98132 for the other question. Certainly they only talk about the internal state in 1 layer of the LLM, they don't need the entire LLM values.
That's exactly what the quoted answer is saying though?
The point I'm trying to make is this: the LLM output is a set of activations. Those are not "hidden" in any way: that is the plain result of running the LLM. Displaying the word "Blue" based on the LLM output is a separate step, one that the inference server performs, completely outside the scope of the LLM.
However, what's unclear to me from the paper is if it's enough to get these activations from the final output layer; or if you actually need some internal activations from a hidden layer deeper in the LLM, one that does require analyzing the internal state of the LLM.
There are also billions of possible Yes/No questions you can ask that won't get unique answers.
The LLM proper will never answer "yes" or "no". It will answer something like "Yes - 99.75%; No - 0.0007%; Blue - 0.0000007%; This - 0.000031%" etc , for all possible tokens. It is this complete response that is apparently unique.
With regular LLM interactions, the inference server then takes this output and actually picks one of these responses using the probabilities. Obviously, that is a lossy and non-injective process.
If the authors are correct (I'm not equipped to judge) then there must be additional output which is thrown away before the user is presented with their yes/no, which can be used to recover the prompt.
It would be pretty cool if this were true. One could annotate results with this metadata as a way of citing sources.
Why do people not believe that LLMs are invertible when we had GPT-2 acting as a lossless text compressor for a demo? That's based on exploiting the invertibility of a model...
https://news.ycombinator.com/item?id=23618465 (The original website this links to is down but proof that GPT-2 worked as lossless text compressor)
I was under the impression that without also forcing the exact seed (which is randomly chosen and usually obfuscated), even providing the same exact prompt is unlikely to provide the same exact summary. In other words, under normal circumstances you can't even prove that a prompt and response are linked.
The seed and the actual randomness is a property of the inferencing infrastructure, not the LLM. The LLM outputs probabilities, essentially.
The paper is not claiming that you can take a dump of ChatGPT responses over the network and figure out what prompts were given. It's much more about a property of the LLM internally.
I'm under the impression that seed only effects anything if temperature > 0. Or more specifically that the LLM given a sequence of input tokens deterministically outputs the probability for each possible next token, and then the only source of randomness is in the procedure for selecting which of those next tokens to use. And that temperature = 0 means the procedure is "select the most likely one" with no randomness at all.
There is a clarification tweet from the authors:
- we cannot extract training data from the model using our method
- LLMs are not injective w.r.t. the output text, that function is definitely non-injective and collisions occur all the time
- for the same reasons, LLMs are not invertible from the output text
https://x.com/GladiaLab/status/1983812121713418606
Clarification [0] by the authors. In short: no, you can't.
[0] https://x.com/GladiaLab/status/1983812121713418606
Thanks - seems like I'm not the only one who jumped to the wrong conclusion.
The input isn't the training data, the input is the prompt.
Ah ok, for some reason that wasn't clear for me.
> they are compressing the data beyond the known limits, or they are abstracting the data into more efficient forms.
I would argue that this is two ways of saying the same thing.
Compression is literally equivalent to understanding.
If we use gzip to compress a calculus textbook does that mean that gzip understands calculus?
Finding repetitions and acting accordingly on them could be considered a very basic form of understanding.
For sure! Measuring parameters given data is central to statistics. It’s a way to concentrate information for practical use. Sufficient statistics are very interesting, bc once computed, they provably contain as much information as the data (lossless). Love statistics, it’s so cool!
I'm not sure if I would call it "abstracting."
Imagine that you have an a spreadsheet that dates from the beginning of the universe to its end. It contains two columns: the date, and how many days it has been since the universe was born. That's very big spreadsheet with lots of data in it. If you plot it, it creates a seemingly infinite diagonal line.
But it can be "abstracted" as Y=X. And that's what ML does.
That's literally what generalization is.
I don't think it's the same thing because an abstraction is still tangible. For example, "rectangle" is an abstraction for all sorts of actual rectangular shapes you can find in practice. We have a way to define what a rectangle is and to identify one.
A neural network doesn't have any actual conceptual backing for what it is doing. It's pure math. There are no abstracted properties beyond the fact that by coincidence the weights make a curve fit certain points of data.
If there was truly a conceptual backing for these "abstractions" then multiple models trained on the same data should have very similar weights as there aren't multiple ways to define the same concepts, but I doubt that this happens in practice. Instead the weights are just randomly adjusted until they fit the points of data without any respect given to whether there is any sort of cohesion. It's just math.
That's like saying multiple programs compiled by different compilers from the same sources should have very similar binaries. You're looking in the wrong place! Similarities are to be expected in the structure of the latent space, not in model weights.
> Wouldn't that mean LLMs represent an insanely efficient form of text compression?
This is a good question worth thinking about.
The output, as defined here (I'm assuming by reading the comment thread), is a set of one value between 0 and 1 for every token the model can treat as "output". The fact that LLM tokens tend not to be words makes this somewhat difficult to work with. If there are n output tokens and the probability the model assigns to each of them is represented by a float32, then the output of the model will be one of at most (2³²)ⁿ = 2³²ⁿ values; this is an upper bound on the size of the output universe.
The input is not the training data but what you might think of as the prompt. Remember that the model answers the question "given the text xx x xxx xxxxxx x, what will the next token in that text be?" The input is the text we're asking about, here xx x xxx xxxxxx x.
The input universe is defined by what can fit in the model's context window. If it's represented in terms of the same tokens that are used as representations of output, then it is bounded above by n+1 (the same n we used to bound the size of the output universe) to the power of "the length of the context window".
Let's assume there are maybe somewhere between 10,000 and 100,000 tokens, and the context window is 32768 (2¹⁵) tokens long.
Say there are 16384 = 2^14 tokens. Then our bound on the input universe is roughly (2^14)^(2^15). And our bound on the output universe is roughly 2^[(2^5)(2^14)] = 2^(2^19).
(2^14)^(2^15) = 2^(14·2^15) < 2^(16·2^15) = 2^(2^19), and 2^(2^19) was our approximate number of possible output values, so there are more potential output values than input values and the output can represent the input losslessly.
For a bigger vocabulary with 2^17 (=131,072) tokens, this conclusion won't change. The output universe is estimated at (2^(2^5))^(2^17) = 2^(2^22); the input universe is (2^17)^(2^15) = 2^(17·2^15) < 2^(32·2^15) = 2^(2^20). This is a huge gap; we can see that in this model, more vocabulary tokens blow up the potential output much faster than they blow up the potential input.
What if we only measured probability estimates in float16s?
Then, for the small 2^14 vocabulary, we'd have roughly (2^16)^(2^14) = 2^(2^18) possible outputs, and our estimate of the input universe would remain unchanged, "less than 2^(2^19)", because the fineness of probability assignment is a concern exclusive to the output. (The input has its own exclusive concern, the length of the context window.) For this small vocabulary, we're not sure whether every possible input can have a unique output. For the larger one, we'll be sure again - the estimate for output will be a (reduced!) 2^(2^21) possible values, but the estimate for input will be an unchanged 2^(2^20) possible values, and once again each input can definitely be represented by a unique output.
So the claim looks plausible on pure information-theory grounds. On the other hand, I've appealed to some assumptions that I'm not sure make sense in general.
> That's why an LLM (I tested this on Grok) can give you a summary of chapter 18 of Mary Shelley's Frankenstein, but cannot reproduce a paragraph from the same text verbatim.
I have some issues with the substance of this, but more to the point it characterizes the problem incorrectly. Frankenstein is part of the training data, not part of the input.
> That's why an LLM (I tested this on Grok) can give you a summary of chapter 18 of Mary Shelley's Frankenstein, but cannot reproduce a paragraph from the same text verbatim.
Unfortunately, the reality is more boring. https://www.litcharts.com/lit/frankenstein/chapter-18 https://www.cliffsnotes.com/literature/frankenstein/chapter-... https://www.sparknotes.com/lit/frankenstein/sparklets/ https://www.sparknotes.com/lit/frankenstein/section9/ https://www.enotes.com/topics/frankenstein/chapter-summaries... https://www.bookey.app/freebook/frankenstein/chapter-18/summ... https://tcanotes.com/drama-frankenstein-ch-18-20-summary-ana... https://quizlet.com/content/novel-frankenstein-chapter-18 https://www.studypool.com/studyGuides/Frankenstein/Chapter_S... https://study.com/academy/lesson/frankenstein-chapter-18-sum... https://ivypanda.com/essays/frankenstein-by-mary-shelley-ana... https://www.shmoop.com/study-guides/frankenstein/chapter-18-... https://carlyisfrankenstein.weebly.com/chapters-18-19.html https://www.markedbyteachers.com/study-guides/frankenstein/c... https://www.studymode.com/essays/Frankenstein-Summary-Chapte... https://novelguide.com/frankenstein/summaries/chap17-18 https://www.ipl.org/essay/Frankenstein-Summary-Chapter-18-90...
I have not known an LLM to be able to summarise a book found in its training data, unless it had many summaries to plagiarise (in which case, actually having the book is unnecessary). I have no reason to believe the training process should result in "abstracting the data into more efficient forms". "Throwing away most of the training data" is an uncharitable interpretation (what they're doing is more sophisticated than that) but, I believe, a correct one.
I don't like the title of this paper, since most people in this space probably think of language models not as producing a distribution (wrt which they are indeed invertible, which is what the paper claims) but as producing tokens (wrt which they are not invertible [0]). Also the author contribution statement made me laugh.
[0] https://x.com/GladiaLab/status/1983812121713418606
But I bet you could reconstruct a plausible set of distributions by just rerunning the autoregression on a given text with the same model. You won't invert the exact prompt but it could give you a useful approximation.
Spoiler but for those who do not want to open the paper, the contribution statement is:
"Equal contribution; author order settled via Mario Kart."
If only more conflicts in life would be settled via Mario Kart.
Yeh it would be fun if we could reverse engineer the prompts from auto generated blog posts. But this is not quite the case.
Still, it is technically correct. The model produces a next-token likelihood distribution, then you apply a sampling strategy to produce a sequence of tokens.
Depends on your definition of the model. Most people would be pretty upset with the usual LLM providers if they drastically changed the sampling strategy for the worse and claimed to not have changed the model at all.
Tailoring the message to the audience is really a fundamental principle of good communication.
Scientists and academics demand an entirely different level of rigor compared to customers of LLM providers.
Sure, but they went slightly overboard with that headline and they knew it. But oh well, they have a lot of eyes and discussion on their paper so it's a success.
I feel like, if the feedback to your paper is "this is over-done / they claim more than they prove / it's kinda hype-ish" you're going to get less references in future papers.
That would seem to be counter to the "impact" goal for research.
Fair enough, that might be more my personal opinion instead of sound advice for successful research. Also I understand that you have a very limited amount of time to get your research noticed in this topic. Who knows if it's relevant two years down the line.
LLM providers are in the stone age with sampling today and it's on purpose because better sampling algorithms make the diversity of synthetic generated data too high, thus meaning your model is especially vulnerable to distillation attacks.
This is why we use top_p/top_k on the big 3 closed source models despite min_p and far better LLM sampling algorithms existing since 2023 (or in TFS case, since 2019)
So, scientists came up with a very specific term for a very specific function, its meaning got twisted by commercial actors and the general public, and it's now the scientists' fault if they keep using it in the original, very specific sense?
I agree it is technically correct, but I still think it is the research paper equivalent of clickbait (and considering enough people misunderstood this for them to issue a semi-retraction that seems reasonable)
I disagree. Within the research community (which is the target of the paper), that title means something very precise and not at all clickbaity. It's irrelevant that the rest of the Internet has an inaccurate notion of "model" and other very specific terms.
It reminded me of "Text embeddings reveal almost as much as text" from 2023 (https://news.ycombinator.com/item?id=37867635) - and yes, they do cite it.
It has a huge implication for privacy. There is some "mental model" that embedding vectors are like hash - so you can store them in database, even though you would not store plain text.
It is an incorrect assumption - as a good embedding stores ALL - not just the general gist, but dates, names, passwords.
There is an easy fix to that - a random rotation; preserves all distances.
This mental model is also in direct contradiction to the whole purpose of the embedding, which is that the embedding describes the original text in a more interpretable form. If a piece of content in the original can be used for search, comparison etc., p much by definition it has to be stored in the embedding.
Similarly, this result can be rephrased as "Language Models process text." If the LLM wasn't invertible with regards to a piece of input text, it couldn't attend to this text either.
> There is an easy fix to that - a random rotation; preserves all distances.
Is that like homomorphic encryption, in a sense, where you can calculate the encryption of a function on the plaintext, without ever seeing the input or calculated function of plaintext.
Summary from the authors:
-Different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space
- Injectivity is not accidental, but a structural property of language models
- Across billions of prompt pairs and several model sizes, we find no collisions: no two prompts are mapped to the same hidden states
- We introduce SipIt, an algorithm that exactly reconstructs the input from hidden states in guaranteed linear time.
- This impacts privacy, deletion, and compliance: once data enters a Transformer, it remains recoverable.
> - This impacts privacy, deletion, and compliance
Surely that's a stretch... Typically, the only thing that leaves a transformer is its output text, which cannot be used to recover the input.
If you claim, for example, that an input is not stored, but examples of internal steps of an inference run _is_ retained, then this paper may suggest a means for recovering the input prompt.
remains recoverable... for less than a training run of compute .It's a lot, but it is doable
Here's an output text: "Yes." Recover the exact input that led to it. (you can't, because the hidden state is already irreversibly collapsed during the sampling of each token)
The paper doesn't claim this to be possible either, they prove the reversibility of the mapping between the input and the hidden state, not the output text. Or rather "near-reversibility", i.e. collisions are technically possible but they have to be very precisely engineered during the model training and don't normally happen.
A few critiques:
- If you have a feature detector function (f(x) = 0 when feature is not present, f(x) = 1 when feature is present) and you train a network to compute f(x), or some subset of the network "decides on its own during training" to compute f(x), doesn't that create a zero set of non-zero measure if training continues long enough?
- What happens when the middle layers are of much lower dimension than the input?
- Real analyticity means infinitely many derivatives (according to Appendix A). Does this mean the results don't apply to functions with corners (e.g. ReLU)?
Author of related work here. This is very cool! I was hoping that they would try to invert layer by layer from the output to the input but it seems that they do a search process at the input layer instead. They rightly point out the residual connections make a layer by layer approach difficult. I may point out though that an rmsnorm layer should be invertible due to the epsilon term in the denominator which can be used to recover the input magnitude
In layman's terms, this seems to mean that given a certain unedited LLM output, plus complete information about the LLM, they can determine what prompt was used to create the output. Except that in practice this works almost never. Am I understanding correctly?
No, it's about the distribution being injective, not a single sampled response. So you need a lot of outputs of the same prompt, and know the LLM, and then you should in theory be able to reconstruct the original prompt.
No, it says nothing about LLM output being invertible.
I'm wondering how this might be summarized in simple terms? It sounds like, after processing some text, the entire prompt is included in the in-memory internal state of the program that's doing inference.
But it seems like it would need to remember the prompt to answer questions about it. How does this interact with the attention mechanism?
My understanding is that they claim that for every unique prompt there is a unique final state of the LLM. Isn't that patently false due to the finite state of the LLM and the ability (in principle, at least) to input arbitrarily large number of unique prompts?
I think their "almost surely" is doing a lot of work.
A more consequential result would give the probability of LLM state collision as a function of the number of unique prompts.
As is, they are telling me that I "almost surely" will not hit the bullseye of a dart board. While likely true, it's not saying much.
But, maybe I misunderstand their conclusion.
I think their claims are limited to the "theoretical" LLM, not to the way we typically use one.
The LLM itself has a fixed size input and a fixed size, deterministic output. The input is the initial value for each neuron in the input layer. The LLM output is the vector of final outputs of each neuron in the output layer. For most normal interactions, these vectors are almost entirely 0s.
Of course, when we say LLM, we typically consider infrastructure that abstracts these things for us. Especially we typically use infra that takes the LLM outputs as probabilities, and thus typically produces different results even for the exact same input - but that's just a choice in how to interpret these values, the values themselves are identical. Similarly on the input side, the max input is typically called a "context window". You can feed more input into the LLM infra than the context window, but that's not actual input to the model itself - the LLM infra will simply pick a part of your input and feed that part into the model weights.
Well that is not how I reed it, but: Every final state has an unique prompt. You could have several final states have the same unique prompt.
> You could have several final states have the same unique prompt.
They explicitly claim that the function is injective, that is, that each unique input produces a unique output.
I think I'm misunderstanding the abstract, but are they trying to say that given a LLM output, they can tell me what the input is? Or given an output AND the intermediate layer weights? If it is the first option, I could use as input 1 "Only respond with 'OK'" and "Please only respond with 'OK'" which leads to 2 inputs producing the same output.
That's not what you get out of LLMs.
LLMs produce a distribution from which to sample the next token. Then there's a loop that samples the next token and feeds it back to to the model until it samples a EndOfSequence token.
In your example the two distributions might be {"OK": 0.997, EOS: 0.003} vs {"OK": 0.998, EOS: 0.002} and what I think the authors claim is that they can invert that distribution to find which input caused it.
I don't know how they go beyond one iteration, as they surely can't deterministically invert the sampling.
Edit: reading the paper, I'm no longer sure about my statement below. The algorithm they introduce claims to do this: "We now show how this property can be used in practice to reconstruct the exact input prompt given hidden states at some layer [emp. mine]". It's not clear to me from the paper if this layer can also be the final output layer, or if it must be a hidden layer.
They claim that they can reverse the LLM (get prompt from LLM response) by only knowing the output layer values, the intermediate layers remain hidden. So, Their claim is that indeed you shouldn't be able to do that (note that this claim applies to the numerical model outputs, not necessarily to the output a chat interface would show you, which goes through some randomization).
This claim's so big that it requires theoretical proof, empirical analysis isn't convincing (given the size of the claim). Causal inference experts have long known that many inputs map to outputs (that's why identification of the inputs that actually caused a given output is a never-ending task).
Does that mean that all these embeddings in all those vector databases can be used to extract all these secret original documents?
I don't think vector databases are intended to be secure, encrypted forms of data storage in the first place.
There is actually a good analytical result on how vector similarity can easily fail to recover relevant information https://arxiv.org/pdf/2403.05440
> For some linear models the similarities are not even unique, while for others they are implicitly controlled by the regularization.
I am not strong in mathematics but if this paper claims run opposite to each other.
What this paper suggests is that LLM hidden states actually preserve inputs with high semantic fidelity. If that’s the case, then the real distortion isn’t inside the network, it’s in the optimization trap at the decoding layer, where rich representations get collapsed into outputs that feel synthetic or generic. In other words, the math may be lossless, but the interface is where meaning erodes.
This is very similar (and maybe even the same thing) to some recent work (published earlier this year) by the people at Ritual AI on attacking attempts to obfuscate LLM inference (which leads to the design for their defense against this, which involves breaking up the prompt token sequences and handing them to multiple computers, making it so no individual machine has access to sufficient states from the hidden layer in a row).
https://arxiv.org/abs/2505.18332
https://arxiv.org/abs/2507.05228
Quoting Timos Moraitis a Neuromorphic PhD
"For reasons like this, "in-context learning" is not an accurate term for transformers. It's projection and storage, nothing is learnt.
This new paper has attracted a lot of interest, and it's nice that it proves things formally and empirically, but it looks like people are surprised by it, even though it was clear."
https://x.com/timos_m/status/1983625714202010111
Injective doesn’t mean bijective, and that seems obvious. That is, presumably very many inputs will map to the output “Yes”.
Afaict surjectivity was already a given before this paper, their contribution is the injectivity part (and thus invertibility)
paper looks nice! i think what they found was that they can recover the input sequence by trying all tokens from the vocab and finding a unique state. they do a forward pass to check each possible token at a given depth. i think this is since the model will encode the sequence in the mid flight token so this encoding is revealed to be unique by their paper. so one prompt of 'the cat sat on the mat' and 'the dog sat on the mat' can be recovered as distinct states via each token being encoded (unclear mechanism but it would be shocking if this wasn't the case) in the token (mid flight residual).
Authors: Giorgos Nikolaou‡*, Tommaso Mencattini†‡*, Donato Crisostomi†, Andrea Santilli†, Yannis Panagakis§¶, Emanuele Rodolà†
†Sapienza University of Rome
‡EPFL
§University of Athens
¶Archimedes RC
*Equal contribution; author order settled via Mario Kart.
Could this be a way to check for AI plagiarism? Given a chunk of text would you be able to (almost) prove that it came from a prompt saying "Write me a short essay on ___" ?
I first had this impression as well, but I can think of several complications:
1) You would have to know the exact model used, and you would need to have access to their weights
2) System prompt and temperature, not sure how this handles those
3) If anyone changes even a word in the output, this method would fall apart
Or maybe paste the essay into the question and ask what prompt was used to produce it.
"And hence invertible" <- does every output embedding combination have an associated input ? Are they able to construct it or is this just an existence result ?
I don't think they're claiming surjectivity here. They're just saying the mapping is injective, so for a given output there should be a unique input to construct it.
> I don't think they're claiming surjectivity here.
What definition of invertible doesn't include surjectivity?
Many? Just add a "by restricting to the image, we may wlog assume surjectivity".
The question is usually more about whether the inverse is also continuous, smooth, easy to compute....etc.
> Building upon this property, we further provide a practical algorithm, SɪᴘIᴛ, that reconstructs the exact input from hidden activations
Actually if you prompt: Answer to this question with "ok, got it"
Answer: >Ok, got it
Answer to this question with exactly "ok, got it"
Answer: >Ok, got it
Hence is not injective
They are looking at the continuous embeddings, not at the discrete words inferred from them
Isn't a requirement for injectivity that each input must map to 1 output? Where LLMs can result in the same output given multiple different inputs?
Are the weights invertible, or are the prompts being fed into the model invertible?
The prompt is reconstructable from the (latent) representation vector but obviously not from the output
I wonder how these pieces of understanding can be applied to neuroscience.
Real brains have a very different structure and composition. Neurons aren't monotonic, there are loops, etc.
They can't. Biological neural networks have no resemblance to the artificial neural networks of the kind used in LLMs. The only similarity is based on a vague computational abstraction of the very primitive understanding of how brains and nerve cells worked we had in the 50s when the first "neural network" was invented.
These pieces of "understanding" mostly don't even applied to other LLMs.
42
In less than 7.5 million years please. https://simple.wikipedia.org/wiki/42_(answer)
I find this interesting. I have tools that attempt to reverse engineer black box models through auto-prompting and analysis of the outputs/tokens. I have used this to develop prompt injection attacks that "steer" output, but have never tried to use the data to recreate an exact input...
tldr: Seeing what happens internally in an LLM lets you reconstruct the original prompt exactly.
Maybe not surprising if you logged all internal activity, but it can be done from only a single snapshot of hidden activations from the standard forward pass.
Am I misunderstanding this?
Any stateful system that exposes state in a flexible way has risk to data exposure.
Does anyone actually think a stateful system wouldn’t release state?
Why not just write a paper “The sky may usually be blue”?
yes, you’re misunderstanding it