These watermarks are not robust to paraphrasing attacks: AUC ROC falls from 0.95 to 0.55 (barely better than guessing) for a 100 token passage.
The existing impossibility results imply that these attacks are essentially unavoidable (https://arxiv.org/abs/2311.04378) and not very costly, so this line of inquiry into LLM watermarking seems like a dead end.
I spent the last five years doing PhD research into steganography, with a particular focus on how to embed messages into LLM outputs. Watermarking is basically one-bit steganography.
The first serious investigations into "secure" steganography were about 30 years ago and it was clearly a dead end even back then. Sure, watermarking might be effective against lazy adversaries--college students, job applicants, etc.--but can be trivially defeated otherwise.
All this time I'd been lamenting my research area as unpopular and boring when I should've been submitting to Nature!
This article goes into it a little bit, but an interview with Scott Aaronson goes into some detail about how watermarking works[0].
He's a theoretical computer scientist but he was recruited by OpenAI to work on AI safety. He has a very practical view on the matter and is focusing his efforts on leveraging the probabilistic nature of LLMs to provide a digital undetectable watermark. So it nudges certain words to be paired together slightly more than random and you can mathematically derive with some level of certainty whether an output or even a section of an output was generated by the LLM. It's really clever and apparently he has a working prototype in development.
Some work arounds he hasn't figured out yet is asking for an output in language X and then translating it into language Y. But those may still be eventually figured out.
I think watermarking would be a big step forward to practical AI safety and ideally this method would be adopted by all major LLMs.
That part starts around 1 hour 25 min in.
> Scott Aaronson: Exactly. In fact, we have a pseudorandom function that maps the N-gram to, let’s say, a real number from zero to one. Let’s say we call that real number ri for each possible choice i of the next token. And then let’s say that GPT has told us that the ith token should be chosen with probability pi.
I don't think that provable watermarking is possible in practice. The method you mention is clever, but before it can work, you would need to know the probability of the every other source which could also be used to generate the output for the same purpose. If you can claim that the probability of that model is much higher on that model than in any other place, including humans, then watermark might give some stronger indications.
You would also need to define probability graph based on the output length. The longer the output, more certain you can be. What is the smallest amount of tokens that cannot be proved at all?
You would also need include humans. Can you define that for human? All LLMs should use the same system uniformally.
Otherwise, "watermaking" is doomed to be misused and not being reliable enough. False accusations will be take a place.
I agree. I'd add that not only could human-written content fail the test -- it's also the case that humans will detect the word pairing, just as they detected "delve" and various other LLM tells.
In time most forms of watermarking along those lines will seem like elements of an LLM's writing style, and will quickly be edited out by savvy users.
>So it nudges certain words to be paired together slightly more than random and you can mathematically derive with some level of certainty whether an output or even a section of an output was generated by the LLM.
hah, every single LLM already watermarks its output by starting the second paragraph with "It is important/essential to remember that..." followed by inane gibberish, no matter what question you ask.
Sounds interesting, but it also sounds like something that could very well be circumvented by using a technique similar to speculative decoding: you use the censored model like you'd use the fast llm in speculative decoding, and you check whether the other model agrees with it or not. But instead of correcting the token every time both models disagree like you'd do with speculative decoding, you just need to change it often enough to mess with the watermark detection function (maybe you'd change every other mismatched token, or maybe one every 5 tokens would be enough to reduce the signal-to-noise ratio below the detection threshold).
You wouldn't even need to have access to an unwatermarked model, the “correcting model” could even be watermaked itself as long as it's not the same watermarking function applied to both.
No you've got it right. Watermarks like this are trivial to defeat, which means they are only effective against lazy users like cheating college students and job applicants.
"An LLM generates text one token at a time. These tokens can represent a single character, word or part of a phrase. To create a sequence of coherent text, the model predicts the next most likely token to generate. These predictions are based on the preceding words and the probability scores assigned to each potential token.
For example, with the phrase “My favorite tropical fruits are __.” The LLM might start completing the sentence with the tokens “mango,” “lychee,” “papaya,” or “durian,” and each token is given a probability score. When there’s a range of different tokens to choose from, SynthID can adjust the probability score of each predicted token, in cases where it won’t compromise the quality, accuracy and creativity of the output.
This process is repeated throughout the generated text, so a single sentence might contain ten or more adjusted probability scores, and a page could contain hundreds. The final pattern of scores for both the model’s word choices combined with the adjusted probability scores are considered the watermark. This technique can be used for as few as three sentences. And as the text increases in length, SynthID’s robustness and accuracy increases."
I'm fascinated that this approach works at all, but that said, I don't believe watermarking text will ever be practical. Yes, you can do an academic study where you have exactly 1 version of an LLM in exactly 1 parameter configuration, and you can have an algorithm that tweaks the logits of different tokens in a way that produces a recognizable pattern. But you should note that the pattern will be recognizable only when the LLM version is locked and the parameter configuration is locked. Which they won't be in the real world. You will have a bunch of different models, and people will use them with a bunch of different parameter combinations. If your "detector" has to be able to recognize AI generated text from a variety of models and a variety of parameter combinations, it's no longer going to work. Even if you imagine someone bruteforcing all these different combos, trouble is that some of the combos will produce false positives just because you tested so many of them. Want to get rid off those false positives? Go ahead, make the pattern stronger. And now you're visibly altering the generated text to an extent where that is a quality issue.
In practice, every programmer or a writer who gets the LLM output, does a lot of rewriting for already existing code, or already existing text. Stitching together parts of many LLM outputs is the only way to use an LLM effectively, even stitching together parts of different LLMs, which i do all the time.
Recognizing only parts of a watermark, and many watermarked parts scattered all around doesn't seem possible at all, in my mind.
They can however develop a software to sell very expensively to universities, schools etc, and it will occasionally catch a very guilty person who uses it all the time and doesn't even try to make the answer better, who always hands over the LLM answer in one piece.
At the end of the day, it will lead to so many false accusations people will stop trusting it. In chess players and tournaments false accusations of cheating happen all the time, for 15 years or more. Right now former world chess champion Kramnik has accused over 50 top chess players of cheating, including the 5 times US champion Nakamura, in the span of 2 months.
If a software like that gets applied to schools and universities, we are gonna have the fun of our lives.
Even with temperature = 0, LLMs are still non-deterministic, as their internal, massively parallelized calculations are done with floating point arithmetic, which is order-dependent. Running the same LLM with the exact same parameters multiple times might still yield slightly different probabilities in the output, making this watermarking scheme even less robust.
This isn't necessarily true, it just depends on the implementation. I can say that because I've published research which embeds steganographic text into the output of GPT-2 and we had to deal with this. Running everything locally was usually fine--the model was deterministic as long as you had the same initial conditions. The problems occurred when trying to run the model on different hardware.
That's not my experience unless LLM providers are caching results. It's frustratingly difficult to get it to output substantially different text for a given prompt. It's like internally it always follows mostly the same reasoning for step 1, then step 2 applies light fudging of the output to give the appearance of randomness, but the underlying structure is generally the same. That's why there's so much blog spam that all pretty much read the same, but while one "delves" into a topic another "dives" into it.
How long until they can write genuinely unique output without piles of additional prompting?
Couldn’t this be easily disrupted as a watermark system by simply changing the words to interfere with the relative checksum?
I suspect sentence structure is also being used or, more likely, the primary “watermark”. Similar to how you can easily identify if something is at least NOT a Yoda quote based on it having incorrect structure. Combine that with other negative patterns like the quote containing Harry Potter references instead of Star Wars, and you can start to build up a profile of trends like this statement.
By rewriting the sentence structure and altering usual wording instead of directly copying the raw output, it seems like you could defeat any current raw watermarking.
Though this hasn’t stopped Google and others in the past using bad science and stats to make unhinged entitled claims like when they added captcha problems everybody said would be “literally impossible“ for bots to solve.
What a surprise how trivial they were to automate and the data they produce can be sold for profit at the expense of mass consumer time.
In principle, it seems like you could have semantic watermarking. For instance, suppose I want a short story. There are lots of different narrative and semantic aspects of it that each carry some number of bits of information: setting, characters, events, and those lay on a probability distribution like anything else. You just subtly shift the probability distribution of those choices, and then it's resistant to word choice, reordering, and any transformation that maintains its semantic meaning.
Why do we need that for photo, video and audio? If it's about the general public believing something false, they're not going to check the watermarks of random internet content or trust anyone who says they checked it. If they really want to know, they can go to the source and if they trust that person or organization, they can also trust the content they published. If it's about use in court, we already have a system for that - the person who recorded it appears in court as a witness and promises that they didn't alter it then if it turns out they did, they can go to prison.
Like all things a computer can / can't do; DRM isn't inherently bad: It's how its used that's a problem.
IE, DRM can't change peoples' motivations. It's useful for things like national security secrets and trade secrets, where the people who have access to the information have very clear motivations to protect that information, and very clear consequences for violating the rules that DRM is in place to protect.
In this case, the big question of if AI watermarking will work / fail has more to do with peoples' motivations: Will the general public accept AI watermarking because it fits our motivations and the consequences we set up for AI masquerading as a real person, or AI being used for misinformation? That's a big question that I can't answer.
This is not a “good deed for the public” done by Google, this is just a self serving tool to enforce their algorithms and digital property. There is nothing “bad” here for the public but it’s certainly not good either.
If by "we" you mean anyone else than Google and the select few other LLM provider they choose to associate with, I'm afraid you're going to be disappointed.
> the team tested it on 20 million prompts given to Gemini. Half of those prompts were routed to the SynthID-Text system and got a watermarked response, while the other half got the standard Gemini response. Judging by the “thumbs up” and “thumbs down” feedback from users, the watermarked responses were just as satisfactory to users as the standard ones.
Three comments here:
1. I wonder how many of the 20M prompts got a thumbs up or down. I don't think people click that a lot. Unless the UI enforces it. I haven't used Gemini, so I might be unaware.
2. Judging a single response might be not enough to tell if watermarking is acceptable or not. For instance, imagine the watermarking is adding "However," to the start of each paragraph. In a single GPT interaction you might not notice it. Once you get 3 or 4 responses it might stand out.
3. Since when Google is happy with measuring by self declared satisfaction? Aren't they the kings of A/B testing and high volume analysis of usage behavior?
My timesheet SAAS constantly asks for feedback, which I give 0/10 as constantly asking for feedback really annoys me.
They then contact me and ask me why, so I tell them then they say there is nothing they can do. A week later I’ll get a pop up asking for feedback and we go round the same loop again.
Google are obviously pushing this as a way to root out AI blog spam.
If only they can get other providers to use it because of 'safety' or something they won't have to change their indexer much. Otherwise page rank is dead due to the ease of creating content farms.
They use the last N prefix tokens, hash them (with a keyed hash), and use the random value to sample the next token by doing an 8-wise tournament, by assigning random bits to each of the top 8 preferred tokens, making pairwise comparisons, and keeping the token with a larger bit. (Yes, it seems complicated, but apparently it increases the watermarking accuracy compared to a straightforward nucleus9 sampling.)
The negative of this approach is that you need to rerun the LLM, so you must keep all versions of all LLMs that you trained, forever.
They actually run 2^30-way tournament (they derive an equivalent form that doesn't requires 2B operations). You do not need to run the LLM, it only depends on the tokenizer.
This strikes me as potentially a bad thing for regular people. For example, corporations call still use AI filtering to force job seekers to jump through hoops but job seekers won't be able to use AI to generate the cover letters and resumes that those hoops demand.
I think your idea is basically right, but there are two points to consider:
- Your hypothesis only holds if the alternative LLM is also "sufficiently good". If Gemini does not stay competitive with other LLMs, Google's AI plans have a much more serious problem.
- Your hypothesis assumes that many people will be capable of detecting the watermarks (both of Gemini and other LLMs) so that they can make a conscious choice for another LLM. But the idea behind good watermarking is that it is not that easy to detect.
But, if all the good models can only be trained by large mega corps with close connections to the government, it's only a matter of time until that other LLM will just add its own watermark.
People use Google Search despite it being littered with adverts and tracking. Maybe Google are counting on either being better than the competition despite watermarking, or simply accepting that people who don't care are enough of a market that it's still worth adding.
If Google locks in enterprise clients using Google Workspace to Gemini then they won't really have a choice. It is selling it as an "add-on" already: https://workspace.google.com/solutions/ai/#plan
Suffice to say it is evident that no other LLM will come close in integration with Google Docs and other Workspace apps as Gemini.
Correct me if I'm wrong, but watermarking is only possible, if the model has a limited set of input you can provide (affects for the output) and a limited set of output it produces, and it should be completely deterministic. And you should pre-calculate all possible combinations.
And this should be also the case for every possible LLMs; then you can compare which LLMs could produce which outputs based on what inputs. Then there is some certainty that this output is produced by this LLM and this another LLM might produce it as well with these inputs.
Why does the user care if its watermarked? Surely there are only some use cases for this stuff where it matters. Most of the time isn't it just people having ephemeral chats where this wouldn't matter?
Agreed its probably a big use case in general, but like token per token I bet its relatively small! How many big papers do you have to write a semester? Even if its four, that's nothing compared to the everyday use you will make of it.
I see no scenario where there won’t be an LLM that is deliberately tailored for that purpose, possibly even built by an “intel” agency for the very purpose of having blackmail over someone that may become useful later in their career.
AIs and LLMs have an extremely uphill PR battle to fight right now. Anything that is deemed AI generated is assumed to be borderline trash (lots of exceptions, but you get the point). So, I can see that if someone uses LLM to generate text, they don’t want it to be marked as “low effort content”.
There are definitely exceptions, and that there are maybe proves that it is less Anti-AI prejudice at play and more just reacting to things that are indeed trashy. It just so happens a lot of it today is from AI I think (for, I hope, obvious reasons).
Just to say, maybe give it a little time, but a watermark like this is not going be thing that decides someone's reaction in the near future, just what it says. (I am just betting here).
But its going to be an uphill battle either way if you are really getting the model to write everything, I do not envy that kind of project.
People made this same argument about DRM escalations, about increasing privacy violations in the browser, and about Google's donations to support climate change misinformation. Even about Facebook interface redesigns. Every variation of "people will be driven to do X" I've ever heard assumes some coherence and unity of collective purpose that rarely matches the reality of how people behave.
There are counter examples, e.g. Unity. But catching that lightning in a bottle is rare and merits special explanation rather than being assumed.
Using LLMs in exams and homeworks has a different driver. Getting caught results in punishment, so using alternative would be better. None of the aforementioned examples have a “stick” aspect to it when you stick to Google.
Do LLM's always pick the most probable next word? I would have thought this would lead to having the same output for every input? How does this deal with the randomness that you get from prompting the same thing over and over?
It depends. If we use beam search we pick the most likely sequence of tokens rather than the most likely token at each point in time. This process is deterministic though.
We can also sample from the distribution, which introduces randomness. Basically, if word1 should be chosen 75% of the time and word2 25% of the time, it will do that.
The randomness you’re seeing can also be due to implementation details.
The most typical reason argmax (temp 0) is non-deterministic is that your request is running batched with other people requests. The number and size of these affects the matrix sizes and thus tiling decisions. Then you get different floating point order and thus different results.
Nvidia gives some guarantees about deterministic results of their kernels but that only applies when you have exact same input data and this is not the case when in-flight batching.
I think people are already doing that. I frequently hear people watermarking their speeches with phrases like "are we aligned on this?", or "let's circle back" and similar.
I can’t tell if this is satire but that’s just corp-speak. I imagine those people also occasionally suggest “touching base” and “taking this offline”.
The phrases usually mean something useful, if one knows the meaning, but it is amusing how much people seem to stick with the same ones, even across companies.
I am not sure whether it was satire. I personally don't like corp speak - it feels like people talking like that are not humans. I am not sure I would welcome our AI overlords speaking like this, either.
But I find the idea that people will subconsciously start copying AI speech patterns (perhaps as a signal of submission) amusing. I think it's gonna throw a wrench into the idea.
IMHO LLMs either should help us communicate more clearly and succinctly, or we can use them as tools for creativity ("rephrase this in 18th century English"). Watermarking speech sabotages both of these use cases.
I really want to be able to try Gemini without the AI watermark. IIRC they've used SynthID from the start and it makes me wonder if it's the source of all of Gemini's issues.
Obviously Google claims that it doesn't cause any issues but I'd think that OpenAI and other competitors would have something similar to SynthID if it didn't impact performance.
I haven't read how Google is doing it, but one way it could be done is to nudge which tokens get sampled. For example, every other token could have an odd numbered id (where each token is assigned an id from 0 to 32000 or however many it has). Then in order to detect the watermark you just tokenize the text and see if the pattern is there. A problem with this approach is that it harms the accuracy and coherency, for example if you ask "What is 2+2", and the token "4" is token #102, and it has to pick an odd-numbered token, then it may respond with a wrong answer or yap on strangely due to its limited selection of tokens (like "The accurate answer to your mathematical query is the number Four")
You can insert known spelling errors, choose certain phrasings, and more. It doesn't have to be new characters added to the text. Government security services have done stuff like this for decades to weed out moles.
You do not even need extra characters (although they help). You can use spaces, missing punctuation, upper/lower case in particular cases, conjunction usage and not using it, word substitution, common misspellings, transposed letters, etc. How many extra spaces/tabs can you add to the end of a paragraph? At the beginning? Between sentences? Inside them? Then you have an AI agent design it and then train another one to detect it.
> SynthID-Text works by discreetly interfering in the generation process: It alters some of the words that a chatbot outputs to the user in a way that’s invisible to humans but clear to a SynthID detector. “Such modifications introduce a statistical signature into the generated text,” [...] “During the watermark detection phase, the signature can be measured to determine whether the text was indeed generated by the watermarked LLM.”
As stated in the article, it alters the probabilities that the network produces in a predictable way so that a different (but still correct-sounding) word is picked. It subtly alters the wording from what it would have output normally in such a way that you can detect it, while still sounding correct to the user
> It has also open-sourced the tool and made it available to developers and businesses, allowing them to use the tool to determine whether text outputs have come from their own large language models (LLMs), the AI systems that power chatbots. However, only Google and those developers currently have access to the detector that checks for the watermark.
These two sentences next to each other don't make much sense. Or are misleading.
Yeah. I know. Only the client is open source and it calls home.
This is information-theoretically guaranteed to make LLM output worse.
My reasoning is simple: the only way to watermark text is to inject some relatively low-entropy signal into it, which can be detected later. This has to a) work for "all" output for some values of all, and b) have a low false positive rate on the detection side. The amount of signal involved cannot be subtle, for this reason.
That signal has a subtractive effect on the predictive-output signal. The entropy of the output is fixed by the entropy of natural language, so this is a zero-sum game: the watermark signal will remove fidelity from the predictive output.
you are correct of we suppose we are at a global optimum. however, consider this example:
i have two hands
i have 2 hands
these sentences communicate the same thing but one could be a watermarked result. we can apply this equivalent meaning word/phrase change many times over and be confident something is watermark while having avoided any semantic shifts.
You're not wrong, but natural language has a lot of stylistic "noise" which can be utilized as a subliminal channel without noticeably degrading the semantic signal.
I want AI to use just the right word when it’s writing for me. If it’s going to nerf itself to not choose the perfect word so it can be watermarked, then why would I use that product? I’ll go somewhere else. And if it does use just the right word, then how is that different from a great human writer?
There is the 'loser's litigation' method of getting all of your non-watermarked competitors banned. Usually involving some combination of magical rights removing brain-hacks like national security or 'the children'.
These watermarks are not robust to paraphrasing attacks: AUC ROC falls from 0.95 to 0.55 (barely better than guessing) for a 100 token passage.
The existing impossibility results imply that these attacks are essentially unavoidable (https://arxiv.org/abs/2311.04378) and not very costly, so this line of inquiry into LLM watermarking seems like a dead end.
I spent the last five years doing PhD research into steganography, with a particular focus on how to embed messages into LLM outputs. Watermarking is basically one-bit steganography.
The first serious investigations into "secure" steganography were about 30 years ago and it was clearly a dead end even back then. Sure, watermarking might be effective against lazy adversaries--college students, job applicants, etc.--but can be trivially defeated otherwise.
All this time I'd been lamenting my research area as unpopular and boring when I should've been submitting to Nature!
This article goes into it a little bit, but an interview with Scott Aaronson goes into some detail about how watermarking works[0].
He's a theoretical computer scientist but he was recruited by OpenAI to work on AI safety. He has a very practical view on the matter and is focusing his efforts on leveraging the probabilistic nature of LLMs to provide a digital undetectable watermark. So it nudges certain words to be paired together slightly more than random and you can mathematically derive with some level of certainty whether an output or even a section of an output was generated by the LLM. It's really clever and apparently he has a working prototype in development.
Some work arounds he hasn't figured out yet is asking for an output in language X and then translating it into language Y. But those may still be eventually figured out.
I think watermarking would be a big step forward to practical AI safety and ideally this method would be adopted by all major LLMs.
That part starts around 1 hour 25 min in.
> Scott Aaronson: Exactly. In fact, we have a pseudorandom function that maps the N-gram to, let’s say, a real number from zero to one. Let’s say we call that real number ri for each possible choice i of the next token. And then let’s say that GPT has told us that the ith token should be chosen with probability pi.
https://axrp.net/episode/2023/04/11/episode-20-reform-ai-ali...
I don't think that provable watermarking is possible in practice. The method you mention is clever, but before it can work, you would need to know the probability of the every other source which could also be used to generate the output for the same purpose. If you can claim that the probability of that model is much higher on that model than in any other place, including humans, then watermark might give some stronger indications.
You would also need to define probability graph based on the output length. The longer the output, more certain you can be. What is the smallest amount of tokens that cannot be proved at all?
You would also need include humans. Can you define that for human? All LLMs should use the same system uniformally.
Otherwise, "watermaking" is doomed to be misused and not being reliable enough. False accusations will be take a place.
I agree. I'd add that not only could human-written content fail the test -- it's also the case that humans will detect the word pairing, just as they detected "delve" and various other LLM tells.
In time most forms of watermarking along those lines will seem like elements of an LLM's writing style, and will quickly be edited out by savvy users.
>So it nudges certain words to be paired together slightly more than random and you can mathematically derive with some level of certainty whether an output or even a section of an output was generated by the LLM.
hah, every single LLM already watermarks its output by starting the second paragraph with "It is important/essential to remember that..." followed by inane gibberish, no matter what question you ask.
I've always felt you'd be able to tell someone uses Reddit because they'll reply to a comment starting the sentence with "The problem is that..."
Now LLMs are trained on Reddit users.
Sounds like LLMs are trained on my posts because i tend to use both of those phrases :-P
Or just check whether text contains the word delve and it's most likely AI generated. I fucking hate that word now.
Sounds interesting, but it also sounds like something that could very well be circumvented by using a technique similar to speculative decoding: you use the censored model like you'd use the fast llm in speculative decoding, and you check whether the other model agrees with it or not. But instead of correcting the token every time both models disagree like you'd do with speculative decoding, you just need to change it often enough to mess with the watermark detection function (maybe you'd change every other mismatched token, or maybe one every 5 tokens would be enough to reduce the signal-to-noise ratio below the detection threshold).
You wouldn't even need to have access to an unwatermarked model, the “correcting model” could even be watermaked itself as long as it's not the same watermarking function applied to both.
Or am I misunderstanding something?
No you've got it right. Watermarks like this are trivial to defeat, which means they are only effective against lazy users like cheating college students and job applicants.
"An LLM generates text one token at a time. These tokens can represent a single character, word or part of a phrase. To create a sequence of coherent text, the model predicts the next most likely token to generate. These predictions are based on the preceding words and the probability scores assigned to each potential token.
For example, with the phrase “My favorite tropical fruits are __.” The LLM might start completing the sentence with the tokens “mango,” “lychee,” “papaya,” or “durian,” and each token is given a probability score. When there’s a range of different tokens to choose from, SynthID can adjust the probability score of each predicted token, in cases where it won’t compromise the quality, accuracy and creativity of the output.
This process is repeated throughout the generated text, so a single sentence might contain ten or more adjusted probability scores, and a page could contain hundreds. The final pattern of scores for both the model’s word choices combined with the adjusted probability scores are considered the watermark. This technique can be used for as few as three sentences. And as the text increases in length, SynthID’s robustness and accuracy increases."
Better link: https://deepmind.google/technologies/synthid/
I'm fascinated that this approach works at all, but that said, I don't believe watermarking text will ever be practical. Yes, you can do an academic study where you have exactly 1 version of an LLM in exactly 1 parameter configuration, and you can have an algorithm that tweaks the logits of different tokens in a way that produces a recognizable pattern. But you should note that the pattern will be recognizable only when the LLM version is locked and the parameter configuration is locked. Which they won't be in the real world. You will have a bunch of different models, and people will use them with a bunch of different parameter combinations. If your "detector" has to be able to recognize AI generated text from a variety of models and a variety of parameter combinations, it's no longer going to work. Even if you imagine someone bruteforcing all these different combos, trouble is that some of the combos will produce false positives just because you tested so many of them. Want to get rid off those false positives? Go ahead, make the pattern stronger. And now you're visibly altering the generated text to an extent where that is a quality issue.
In summary, this will not work in practice. Ever.
In practice, every programmer or a writer who gets the LLM output, does a lot of rewriting for already existing code, or already existing text. Stitching together parts of many LLM outputs is the only way to use an LLM effectively, even stitching together parts of different LLMs, which i do all the time.
Recognizing only parts of a watermark, and many watermarked parts scattered all around doesn't seem possible at all, in my mind.
They can however develop a software to sell very expensively to universities, schools etc, and it will occasionally catch a very guilty person who uses it all the time and doesn't even try to make the answer better, who always hands over the LLM answer in one piece.
At the end of the day, it will lead to so many false accusations people will stop trusting it. In chess players and tournaments false accusations of cheating happen all the time, for 15 years or more. Right now former world chess champion Kramnik has accused over 50 top chess players of cheating, including the 5 times US champion Nakamura, in the span of 2 months.
If a software like that gets applied to schools and universities, we are gonna have the fun of our lives.
Even with temperature = 0, LLMs are still non-deterministic, as their internal, massively parallelized calculations are done with floating point arithmetic, which is order-dependent. Running the same LLM with the exact same parameters multiple times might still yield slightly different probabilities in the output, making this watermarking scheme even less robust.
This isn't necessarily true, it just depends on the implementation. I can say that because I've published research which embeds steganographic text into the output of GPT-2 and we had to deal with this. Running everything locally was usually fine--the model was deterministic as long as you had the same initial conditions. The problems occurred when trying to run the model on different hardware.
That's not my experience unless LLM providers are caching results. It's frustratingly difficult to get it to output substantially different text for a given prompt. It's like internally it always follows mostly the same reasoning for step 1, then step 2 applies light fudging of the output to give the appearance of randomness, but the underlying structure is generally the same. That's why there's so much blog spam that all pretty much read the same, but while one "delves" into a topic another "dives" into it.
How long until they can write genuinely unique output without piles of additional prompting?
Couldn’t this be easily disrupted as a watermark system by simply changing the words to interfere with the relative checksum?
I suspect sentence structure is also being used or, more likely, the primary “watermark”. Similar to how you can easily identify if something is at least NOT a Yoda quote based on it having incorrect structure. Combine that with other negative patterns like the quote containing Harry Potter references instead of Star Wars, and you can start to build up a profile of trends like this statement.
By rewriting the sentence structure and altering usual wording instead of directly copying the raw output, it seems like you could defeat any current raw watermarking.
Though this hasn’t stopped Google and others in the past using bad science and stats to make unhinged entitled claims like when they added captcha problems everybody said would be “literally impossible“ for bots to solve.
What a surprise how trivial they were to automate and the data they produce can be sold for profit at the expense of mass consumer time.
In principle, it seems like you could have semantic watermarking. For instance, suppose I want a short story. There are lots of different narrative and semantic aspects of it that each carry some number of bits of information: setting, characters, events, and those lay on a probability distribution like anything else. You just subtly shift the probability distribution of those choices, and then it's resistant to word choice, reordering, and any transformation that maintains its semantic meaning.
Much simpler: make every sentence contain an even number of words. Then the chances of 10 sentences in a row to be all even is about 0.1%.
Some of the watermarking is really obvious. If you write song lyrics in ChatGPT, watch for phrases like "come what may" and "I stand tall."
It's not just that they are (somewhat) unusual phrases, it's that ChatGPT comes up with those phrases so very often.
It's quite like how earlier versions always had a "However" in between explanations.
Coheed and Cambria were using ChatGPT this whole damn time, smh
ChatGPT does not have a watermark.
It has a rich tapestry of watermarks.
A delicate dance of them, perhaps
It's like a symphony of watermarks, all playing in harmony.
I suggest we "delve" deeper int this problem.
What makes you sure about that?
I think we just need to give up on this. What’s the harm? It’s not like some ground truth is fabricated.
I’m far, far more concerned about photo, video, and audio verification. We need a camera that can guarantee a recording is real.
I've been thinking about this for a while. Digital signatures can guarantee that a piece of data is authentic, if the author wishes to sign it.
Why do we need that for photo, video and audio? If it's about the general public believing something false, they're not going to check the watermarks of random internet content or trust anyone who says they checked it. If they really want to know, they can go to the source and if they trust that person or organization, they can also trust the content they published. If it's about use in court, we already have a system for that - the person who recorded it appears in court as a witness and promises that they didn't alter it then if it turns out they did, they can go to prison.
Google is branding this in a positive light but this is just AI text DRM.
It's likely more about preventing model incest than digital rights management
Like all things a computer can / can't do; DRM isn't inherently bad: It's how its used that's a problem.
IE, DRM can't change peoples' motivations. It's useful for things like national security secrets and trade secrets, where the people who have access to the information have very clear motivations to protect that information, and very clear consequences for violating the rules that DRM is in place to protect.
In this case, the big question of if AI watermarking will work / fail has more to do with peoples' motivations: Will the general public accept AI watermarking because it fits our motivations and the consequences we set up for AI masquerading as a real person, or AI being used for misinformation? That's a big question that I can't answer.
This is not a “good deed for the public” done by Google, this is just a self serving tool to enforce their algorithms and digital property. There is nothing “bad” here for the public but it’s certainly not good either.
I for one am glad we might have a path forward to filtering out LLM-generated sludge.
> we
If by "we" you mean anyone else than Google and the select few other LLM provider they choose to associate with, I'm afraid you're going to be disappointed.
If there is a detectable fingerprint, we can detect it too. Probably don't even need a Bletchley Park.
> the team tested it on 20 million prompts given to Gemini. Half of those prompts were routed to the SynthID-Text system and got a watermarked response, while the other half got the standard Gemini response. Judging by the “thumbs up” and “thumbs down” feedback from users, the watermarked responses were just as satisfactory to users as the standard ones.
Three comments here:
1. I wonder how many of the 20M prompts got a thumbs up or down. I don't think people click that a lot. Unless the UI enforces it. I haven't used Gemini, so I might be unaware.
2. Judging a single response might be not enough to tell if watermarking is acceptable or not. For instance, imagine the watermarking is adding "However," to the start of each paragraph. In a single GPT interaction you might not notice it. Once you get 3 or 4 responses it might stand out.
3. Since when Google is happy with measuring by self declared satisfaction? Aren't they the kings of A/B testing and high volume analysis of usage behavior?
> I don't think people click that a lot.
I sometimes do, but I almost always give wrong answer or opposite answer where possible.
I suspect your account's feedback would be easily filtered out
but why? what for?
My timesheet SAAS constantly asks for feedback, which I give 0/10 as constantly asking for feedback really annoys me.
They then contact me and ask me why, so I tell them then they say there is nothing they can do. A week later I’ll get a pop up asking for feedback and we go round the same loop again.
Because companies like Google are a cancer and I don't want to give them data they didn't pay for.
hm
reminds me of "what have the romans ever done for us?"
but thx for elaborating.
Google are obviously pushing this as a way to root out AI blog spam.
If only they can get other providers to use it because of 'safety' or something they won't have to change their indexer much. Otherwise page rank is dead due to the ease of creating content farms.
The academic paper: https://www.nature.com/articles/s41586-024-08025-4
They use the last N prefix tokens, hash them (with a keyed hash), and use the random value to sample the next token by doing an 8-wise tournament, by assigning random bits to each of the top 8 preferred tokens, making pairwise comparisons, and keeping the token with a larger bit. (Yes, it seems complicated, but apparently it increases the watermarking accuracy compared to a straightforward nucleus9 sampling.)
The negative of this approach is that you need to rerun the LLM, so you must keep all versions of all LLMs that you trained, forever.
They actually run 2^30-way tournament (they derive an equivalent form that doesn't requires 2B operations). You do not need to run the LLM, it only depends on the tokenizer.
Why do you need to rerun the LLM? Watermark detection only requires the hash functions (equation (1) from the paper).
This strikes me as potentially a bad thing for regular people. For example, corporations call still use AI filtering to force job seekers to jump through hoops but job seekers won't be able to use AI to generate the cover letters and resumes that those hoops demand.
Who is going to pay for watermarked output?
Correct me if I’m wrong, but wouldn’t it simply drive people to use LLMs that are not watermarking their content?
I think your idea is basically right, but there are two points to consider:
- Your hypothesis only holds if the alternative LLM is also "sufficiently good". If Gemini does not stay competitive with other LLMs, Google's AI plans have a much more serious problem.
- Your hypothesis assumes that many people will be capable of detecting the watermarks (both of Gemini and other LLMs) so that they can make a conscious choice for another LLM. But the idea behind good watermarking is that it is not that easy to detect.
According to the article, you can just have another LLM summarise Gemini's watermarked output and that will "likely" defeat the watermark detection.
But, if all the good models can only be trained by large mega corps with close connections to the government, it's only a matter of time until that other LLM will just add its own watermark.
People use Google Search despite it being littered with adverts and tracking. Maybe Google are counting on either being better than the competition despite watermarking, or simply accepting that people who don't care are enough of a market that it's still worth adding.
Agreed, that's the obvious prediction. They're also going to perform worse on 3p benchmarks right?
If Google locks in enterprise clients using Google Workspace to Gemini then they won't really have a choice. It is selling it as an "add-on" already: https://workspace.google.com/solutions/ai/#plan
Suffice to say it is evident that no other LLM will come close in integration with Google Docs and other Workspace apps as Gemini.
Correct me if I'm wrong, but watermarking is only possible, if the model has a limited set of input you can provide (affects for the output) and a limited set of output it produces, and it should be completely deterministic. And you should pre-calculate all possible combinations.
And this should be also the case for every possible LLMs; then you can compare which LLMs could produce which outputs based on what inputs. Then there is some certainty that this output is produced by this LLM and this another LLM might produce it as well with these inputs.
So... impossible?
Why does the user care if its watermarked? Surely there are only some use cases for this stuff where it matters. Most of the time isn't it just people having ephemeral chats where this wouldn't matter?
Using LLMs to write your essays and reports for school or uni, in a way that could get you punished if caught, is a reasonably big use case.
Agreed its probably a big use case in general, but like token per token I bet its relatively small! How many big papers do you have to write a semester? Even if its four, that's nothing compared to the everyday use you will make of it.
I see no scenario where there won’t be an LLM that is deliberately tailored for that purpose, possibly even built by an “intel” agency for the very purpose of having blackmail over someone that may become useful later in their career.
AIs and LLMs have an extremely uphill PR battle to fight right now. Anything that is deemed AI generated is assumed to be borderline trash (lots of exceptions, but you get the point). So, I can see that if someone uses LLM to generate text, they don’t want it to be marked as “low effort content”.
There are definitely exceptions, and that there are maybe proves that it is less Anti-AI prejudice at play and more just reacting to things that are indeed trashy. It just so happens a lot of it today is from AI I think (for, I hope, obvious reasons).
Just to say, maybe give it a little time, but a watermark like this is not going be thing that decides someone's reaction in the near future, just what it says. (I am just betting here).
But its going to be an uphill battle either way if you are really getting the model to write everything, I do not envy that kind of project.
People made this same argument about DRM escalations, about increasing privacy violations in the browser, and about Google's donations to support climate change misinformation. Even about Facebook interface redesigns. Every variation of "people will be driven to do X" I've ever heard assumes some coherence and unity of collective purpose that rarely matches the reality of how people behave.
There are counter examples, e.g. Unity. But catching that lightning in a bottle is rare and merits special explanation rather than being assumed.
Using LLMs in exams and homeworks has a different driver. Getting caught results in punishment, so using alternative would be better. None of the aforementioned examples have a “stick” aspect to it when you stick to Google.
Do LLM's always pick the most probable next word? I would have thought this would lead to having the same output for every input? How does this deal with the randomness that you get from prompting the same thing over and over?
It depends. If we use beam search we pick the most likely sequence of tokens rather than the most likely token at each point in time. This process is deterministic though.
We can also sample from the distribution, which introduces randomness. Basically, if word1 should be chosen 75% of the time and word2 25% of the time, it will do that.
The randomness you’re seeing can also be due to implementation details.
https://community.openai.com/t/a-question-on-determinism/818...
There is at least a parameter called Temperature which decides how much randomness to include in the output.
It doesn't get you perfectly deterministic output to set it to 0 though, per https://medium.com/google-cloud/is-a-zero-temperature-determ... as you don't have perfect control over what approximations are being made on your floating point operations
The most typical reason argmax (temp 0) is non-deterministic is that your request is running batched with other people requests. The number and size of these affects the matrix sizes and thus tiling decisions. Then you get different floating point order and thus different results.
Nvidia gives some guarantees about deterministic results of their kernels but that only applies when you have exact same input data and this is not the case when in-flight batching.
I think people are already doing that. I frequently hear people watermarking their speeches with phrases like "are we aligned on this?", or "let's circle back" and similar.
I can’t tell if this is satire but that’s just corp-speak. I imagine those people also occasionally suggest “touching base” and “taking this offline”.
The phrases usually mean something useful, if one knows the meaning, but it is amusing how much people seem to stick with the same ones, even across companies.
I am not sure whether it was satire. I personally don't like corp speak - it feels like people talking like that are not humans. I am not sure I would welcome our AI overlords speaking like this, either.
But I find the idea that people will subconsciously start copying AI speech patterns (perhaps as a signal of submission) amusing. I think it's gonna throw a wrench into the idea.
IMHO LLMs either should help us communicate more clearly and succinctly, or we can use them as tools for creativity ("rephrase this in 18th century English"). Watermarking speech sabotages both of these use cases.
I really want to be able to try Gemini without the AI watermark. IIRC they've used SynthID from the start and it makes me wonder if it's the source of all of Gemini's issues.
Obviously Google claims that it doesn't cause any issues but I'd think that OpenAI and other competitors would have something similar to SynthID if it didn't impact performance.
> IIRC they've used SynthID from the start
Is that not at odds with what's presented in the article here?
OT: The publication (Spectrum by IEEE) has some really good content.
It's starting to become a common destination for when I want to read about interesting things.
"I hope this message finds you well." --- busted!
How is this supposed to work? By inserting special unicode characters?
How can you watermark text?
I haven't read how Google is doing it, but one way it could be done is to nudge which tokens get sampled. For example, every other token could have an odd numbered id (where each token is assigned an id from 0 to 32000 or however many it has). Then in order to detect the watermark you just tokenize the text and see if the pattern is there. A problem with this approach is that it harms the accuracy and coherency, for example if you ask "What is 2+2", and the token "4" is token #102, and it has to pick an odd-numbered token, then it may respond with a wrong answer or yap on strangely due to its limited selection of tokens (like "The accurate answer to your mathematical query is the number Four")
You can insert known spelling errors, choose certain phrasings, and more. It doesn't have to be new characters added to the text. Government security services have done stuff like this for decades to weed out moles.
moles should know better than to utilize mountweazels! https://en.wikipedia.org/wiki/Fictitious_entry
We've been studying unintentional watermarks for years.
https://en.wikipedia.org/wiki/Stylometry
You do not even need extra characters (although they help). You can use spaces, missing punctuation, upper/lower case in particular cases, conjunction usage and not using it, word substitution, common misspellings, transposed letters, etc. How many extra spaces/tabs can you add to the end of a paragraph? At the beginning? Between sentences? Inside them? Then you have an AI agent design it and then train another one to detect it.
> SynthID-Text works by discreetly interfering in the generation process: It alters some of the words that a chatbot outputs to the user in a way that’s invisible to humans but clear to a SynthID detector. “Such modifications introduce a statistical signature into the generated text,” [...] “During the watermark detection phase, the signature can be measured to determine whether the text was indeed generated by the watermarked LLM.”
As stated in the article, it alters the probabilities that the network produces in a predictable way so that a different (but still correct-sounding) word is picked. It subtly alters the wording from what it would have output normally in such a way that you can detect it, while still sounding correct to the user
There's an article from ieee that explains it:
https://spectrum.ieee.org/watermark#:~:text=How%20Google%E2%...
> It has also open-sourced the tool and made it available to developers and businesses, allowing them to use the tool to determine whether text outputs have come from their own large language models (LLMs), the AI systems that power chatbots. However, only Google and those developers currently have access to the detector that checks for the watermark.
These two sentences next to each other don't make much sense. Or are misleading.
Yeah. I know. Only the client is open source and it calls home.
Is there significant throttling to prevent us from training a classification model against it?
This is information-theoretically guaranteed to make LLM output worse.
My reasoning is simple: the only way to watermark text is to inject some relatively low-entropy signal into it, which can be detected later. This has to a) work for "all" output for some values of all, and b) have a low false positive rate on the detection side. The amount of signal involved cannot be subtle, for this reason.
That signal has a subtractive effect on the predictive-output signal. The entropy of the output is fixed by the entropy of natural language, so this is a zero-sum game: the watermark signal will remove fidelity from the predictive output.
This is impossible to avoid or fix.
you are correct of we suppose we are at a global optimum. however, consider this example:
i have two hands
i have 2 hands
these sentences communicate the same thing but one could be a watermarked result. we can apply this equivalent meaning word/phrase change many times over and be confident something is watermark while having avoided any semantic shifts.
You're not wrong, but natural language has a lot of stylistic "noise" which can be utilized as a subliminal channel without noticeably degrading the semantic signal.
I want AI to use just the right word when it’s writing for me. If it’s going to nerf itself to not choose the perfect word so it can be watermarked, then why would I use that product? I’ll go somewhere else. And if it does use just the right word, then how is that different from a great human writer?
There is the 'loser's litigation' method of getting all of your non-watermarked competitors banned. Usually involving some combination of magical rights removing brain-hacks like national security or 'the children'.