Scalable watermarking for identifying large language model outputs

(nature.com)

56 points | by ghshephard 4 days ago ago

36 comments

michaelt 14 hours ago

As far as I can tell, what they're proposing is:

Today, for each output token the LLM produces a probability for each possible output, then a 'sampler' makes a probability-weighted random choice. If the next-token probabilities are 90% for foo, 9% for bar and 1% for baz then the sampler chooses a random number between 0 and 1, if it's <0.9 it outputs foo, 0.9-0.99 it outputs bar, 0.99-1 it outputs baz.

But what if instead of using random numbers, you had a source of evenly distributed random numbers that was deterministic, based on some secret key?

Each candidate token would remain just as likely as it was before - there would still be a 90% chance of foo being chosen. So the output shouldn't degrade in quality.

And sure, some tokens will have 99.999% probability and their selection doesn't tell you much. But in most real-world use multiple wordings are possible and so on. So across a large enough sample of the output, you could detect whether the sampler was following your secret deterministic pattern.

Of course the downside is you've got to check on exactly the same LLM, and only people with the secret key can perform the check. And it's only applicable to closed-source LLMs.

I'm also not quite sure if it works when you don't know the exact prompt - so maybe my understanding of the paper is all wrong?

[-]

genrilz 14 hours ago

My understanding from the "Watermark Detection" section is that it only requires the key and the output text in order to do the detection. In particular, it seems like the random seed used for each token is only based off the previous 4 tokens and the LLM specific key, so for any output larger than 4 tokens, you can start to get a signal.

I don't think the key actually needs to be secret, as it's not trying to cryptographicly secure anything. So all closed weights LLM providers could just publicly share the keys they use for water marking, and then anybody could use them to check if a particular piece of text was generated by a particular LLM.

That being said, I think you are right about this only really being useful for closed weights models. If you have the weights, you can just run an LLM through a standard sampler and it won't be watermarked.

[-]

danielmarkbruce 12 hours ago

Why would anyone ever use such a model? And then, given the significant reduction in users, why would any closed model service do this?

Seems like a cool theoretical trick that has little practical implication.

[-]

genrilz 12 hours ago

There are situations where the model output being watermarked doesn't matter. For instance, I hear people on HN asking LLMs to explain things to them all the time, (which I think is a bad idea, but YMMV) and people use LLMs to write code quickly. (which I think is at least possibly a good idea) There are also some content farms which churn out low quality books on Amazon on various topics, and I don't think they care if they get caught using LLM outputs.

Thus it might reduce usage some, but it certainly wouldn't block all usage. Additionally, there are only a few providers of truly massive LLMs on the market right now. If they decided that doing this would be a social good, or more likely that it would bring bad PR to not do this when their competitors do, then they would at least be able to watermark all of the massive LLM outputs.

[-]

danielmarkbruce 10 hours ago

You say that as though there isn't a choice though. There will always be a good offering who doesn't water mark.

And there is no good reason for a provider to watermark - they aren't helping the customer. They'd be helping some other party who isn't paying them.

This will never be a thing.

[-]

alach11 5 hours ago

> There will always be a good offering who doesn't water mark

There's a possible future where this gets legislated, right? Of course, there are lots of implementation challenges to this and it's probably a bad idea...

IanCal 11 hours ago

I'm totally happy having huge amounts of my use of llms identifiable as from an llm. I don't see many important cases for me where I need to pretend it wasn't from an llm.

I will happily lose those cases for increased performance, that's the thing I care about.

Are there normal cases where you picture this as an issue?

[-]

eitland 10 hours ago

Not a problem for me. I am not a student anymore.

And I am not against LLM output being identifiable as such. (although I think an argument could be made based on the ruling about the monkey and the camera, which IIRC would say that the copyright belongs to whoever created the situation).

But after the

1. British Post Office scandal and

2. some really high profile cases of education institutions here in Norway abusing plagiarism detectors

I do not feel ready to trust neither

1. complex software (and especially not closed sourced software) to tell us who is cheating or not

2. nor any humans ability to use such a system in a sensible way

While cheating isn't usually criminal court, students also usually does not get a free defense.

For this reason I suggest cheating should have to be proven to have occurred, not "suggested to probably have occurred" by the same people who creates the not very reliable and extremely hard-to-reproduce LLMs.

danielmarkbruce 9 hours ago

Increased performance? Watermarking will not increase performance. They are talking about tilting the decoding process in minor ways. It won't help (or hurt much) performance.

Buttons840 10 hours ago

I haven't been in school since LLMs became useful, but if I were to "cheat", I'd ask the LLM for a very fine grained outline, and then just translate the outline into my own words. Then ask the LLM to fill in my citations in AMA format.

And honestly, this still retains like 95% of the value of writing a paper, because I did write it, the words did flow through my brain. I just used the LLM to avoid facing a blank page.

I've also thought about asking LLMs to simulate a forum conversation about the Civil War (or whatever the topic may be), and include a wrong comment that can be countered by writing exactly what the assignment requires, because I seem to have no trouble writing an essay when duty calls and someone is wrong in the internet.

[-]

qeternity 9 hours ago

These things don’t have to be foolproof to be effective deterrents.

zmgsabst 13 hours ago

What happens if I tell the LLM to reword every other sentence? — or every 5th word?

I must be missing something, because this seems to assume a contiguous output.

[-]

genrilz 12 hours ago

It's possible that this might break the method, but what seems most likely to me is that the LLM will simply reword every 5th word with some other word that it is more likely to use due to the watermark sampling. Thus the resulting output would display roughly the same level of "watermarkedness".

You might be able to have one LLM output the original, and then another to do a partial rewording though. The resulting text would likely have higher than chance "watermarkedness" for both LLMs, but less than you would expect from a plain output. Perhaps this would be sufficient for short enough outputs?

[-]

namrog84 8 hours ago

What happens when we all reading llm output all the time. Simply start to adapt more to llm writing styles, word choice, and possibly without realizing it watermark our own original writing?

[-]

genrilz 5 hours ago

You might be right, but my first instinct is that this probably wouldn't happen enough to throw off the water marking to badly.

The most likely used word is based off the previous four, and only works if there is enough entropy present that one of multiple word would work. Thus its not a simple matter of humans picking up particular word choices. There might be some cases where there are 3 tokens in a row that occur with low entropy after the first token, and then one token generation with high entropy at the end. That would cause a particular 5 word phrase to occur. Otherwise, the word choice would appear pretty random. I don't think humans pick up on stuff like that even subconsciously, but I could be wrong.

I would be interested to see if LLMs pick up the watermarks when fed watermarked training data though. Evidently ChatGPT can decode base64, [0] so it seems like these things can pick up on some pretty subtle patterns.

[0] https://www.reddit.com/r/ChatGPT/comments/1645n6i/i_noticed_...

mightybyte 12 hours ago

I have a question for all the LLM and LLM-detection researchers out there. Wikipedia says that the Turing test "is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human."

Three things seem to be in conflict here:

1. This definition of intelligence...i.e. "behavior indistinguishable from a human"

2. The idea that LLMs are artificial intelligence

3. The idea that we can detect if something is generated by an LLM

This feels to me like one of those trilemmas, where only two of the three can be true. Or, if we take #1 as an axiom, then it seems like the extent to which we can detect when things are generated by an LLM would imply that the LLM is not a "true" artificial intelligence. Can anyone deeply familiar with the space comment on my reasoning here? I'm particularly interested in thoughts from people actually working on LLM detection. Do you think that LLM-detection is technically feasible? If so, do you think that implies that they're not "true" AI (for whatever definition of "true" you think makes sense)?

[-]

roywiggins 11 hours ago

The original Turing test started by imagining you're trying to work out which of two people is a man or woman based on their responses to questions alone.

But supposing that you ran that test where one of the hidden people is a confederate that steganographically embeds a gender marker without it being obvious to anyone but yourself. You would be able to break the game, even if your confederate was perfectly mimicking the other gender.

That is to say, embedding a secret recognition code into a stream of responses works on humans, too, so it doesn't say anything about computer intelligence.

And for that matter, passing the Turing test is supposed to be sufficient for proving that something is intelligent, not necessary. You could imagine all sorts of deeply inhuman but intelligent systems that completely fail the Turing test. In Blade Runner, we aren't supposed to conclude that failing the Voight-Kampff test makes the androids mindless automatons, even if that's what humans in the movie think.

visarga 10 hours ago

I think measuring intelligence in isolation is misguided, it should always be measured in context. Both the social context and the problem context. This removes a lot of mystique and unfortunately doesn't make for heated debates.

In its essentialist form it's impossible to define, but in context it is nothing but skilled search for solutions. And because most problems are more than one can handle, it's a social process.

Can you measure the value of a word in isolation from language? In the same way you can't meaningfully measure intelligence in a vacuum. You get a very narrow representation of it.

warkdarrior 10 hours ago

> 3. The idea that we can detect if something is generated by an LLM

The idea behind watermarking (the topic of the paper) is that the output of the LLM is specially marked in some way at the time of generation, by the LLM service. Afterwards, any text can be checked for the presence of the watermark. In this case, detect if something is generated by an LLM means checking for the presence of the watermark. This all works if the watermark is robust.

Joel_Mckay 11 hours ago

The idea that LLM can pass off as a human author for all reviewers is demonstrably false:

https://www.youtube.com/watch?v=zB_OApdxcno

viraptor 15 hours ago

What they didn't put in the limitations or other sections (unless I missed it) is that it can only apply to larger creative text, not to structured or repeated output. For example if you want to watermark generated code, you can't produce it as a diff to the existing file - the sampling changes will cause unwanted modifications.

Similar for things like "fix grammar in this long text" will have to tweak random words without a reason, because the existing text can't be 100% reproduced while injecting synth-id.

[-]

jsenn 14 hours ago

This is discussed in the "Watermarking with Synth-ID Text" section right after they define the Score function:

> There are two primary factors that affect the detection performance of the scoring function. The first is the length of the text x: longer texts contain more watermarking evidence, and so we have more statistical certainty when making a decision. The second is the amount of entropy in the LLM distribution when it generates the watermarked text x. For example, if the LLM distribution is very low entropy, meaning it almost always returns the exact same response to the given prompt, then Tournament sampling cannot choose tokens that score more highly under the g functions. In short, like other generative watermarks, Tournament sampling performs better when there is more entropy in the LLM distribution, and is less effective when there is less entropy.

zebomon 15 hours ago

Worth pointing out that while watermarking is mathematically reliable, the scammers who are selling "AI detection" don't have the weight-level access that it requires.

[-]

eitland 12 hours ago

> watermarking is mathematically reliable

So should accounting be. To a much higher degree.

Yet I hope most of us are aware of the British Post Office scandal where what really should be accounting software falsely accused thousands of employees of theft, of which over 900 were convicted of theft, fraud and false accounting.

If this can happen in something as utterly boring as an accounting system should be in this millennium, I don't think we should trust AI fraud detection in science and academia until we get a few decades of experience with it.

(Do I think records from accounting can be used as evidence? Absolutely, given the right circumstances: we can know it hasn't been tampered with etc

What I don't think however is that pattern matching or "watermarks" that indicates probability should be used as evidence. Especially not closed source systems with secret distributions and watermarking algorithms.)

[-]

zebomon 10 hours ago

I agree with you completely.

In the wild, there are too many variables to use watermarking to draw meaningful conclusions about any piece of text, no matter the word count. Scott Aaronson described well one of those variables, "the pineapple attack," in his Aug. 2023 talk at Simons [1].

Watermarking is illuminating to the ongoing study of language models' functionality, but it doesn't put the genie back in the bottle.

1. https://www.youtube.com/watch?v=2Kx9jbSMZqA

awongh 14 hours ago

After skimming through the paper I can’t immediately pick out the data that says how much more certainty there is for a given text to detect a watermark, and the graph of that certainty as the text size grows. (They seem to assert that the certainty grows as token count goes up, but it’s not clear by how much.)

I worry (and have already read worrying things) about “cheating detection” tools that have been deployed in schools. My intuition would be that there’s just too much entropy between something like an essay prompt and the essay itself. I guess it also depends on how specific the teacher’s essay prompt is as well.

[-]

eitland 12 hours ago

> I worry (and have already read worrying things) about “cheating detection” tools that have been deployed in schools.

This is my worry as well.

Punishment for cheating can be can easily set back a student a year or more. This is fair if the student has been cheating, but really harsh.

So while this isn't criminal court, I think schools should apply the same principles here: innocent until proven guilty.

And in my view, secret probability distributions isn't exactly good proof.

Furthermore, to make it even worse: if someone is actually innocent it will be next to impossible to argue their innocence since everyone will trust the system and as far as I can see the system cannot be actually verified by a board without disclosing the weights. And that is assuming they would care to try to help a student prove their innocence in the first place.

AFAIK this is a topic that has been explored to some depth in science fiction, but more importantly, we have case like the mail service in UK where multiple people lost their jobs because nobody could belive the system they had built or paid for could make such crazy mistakes.

Back to students: For a less privileged student I guess it can easily ruin their studies. TBH as someone who struggled a lot in school I am not sure I'd finished if I had gotten my studies delayed by a year. Which would have been sad, given how well I have managed once I didn't have to juggle full time studies and part time work.

Recently (last year and this) we (Norway) have had some debates that seemed to be way overdue regarding what can ve considered cheating (with some ridiculous examples of students getting punished for "self-plagiarism" for the most absurd things, including not specifying a failed previous exam written by themselves as a source).

This could easily have gotten nowhere except for the fact that:

1. the person in charge of the board of appeals was caught for something else

2. Somebody took the effort to dig out the master thesis from two ministers, including the then sitting Minister of Education and proved that they had clearly been "cheating" according to the rules that they were judging students by.

ape4 13 hours ago

Its easy to think of non-secure watermark methods to mark LLM generated text for lazy students or lazy copy writers. Occassional incorrect capitalization, etc

[-]

compootr 10 hours ago

one I thought about was zero-width spaces. if you add a sequence of them, lazy copiers will paste them, and you'll be able to test text with almost no computational overhead!

andrewla 12 hours ago

Several commenters who have not read the abstract of the paper are mentioning LLM-detection tools. That is not what is being shown here.

Rather they are saying how to modify the design of an LLM to deliberately inject watermarks into generated text such that it will be possible to detect that the text came from a particular LLM.

While interesting in the abstract, I think I can definitively say that absolutely nobody wants this. People trying to pass off LLM content (whether students or content providers) as human-written are not interested in being detected. People who are using LLMs to get information for their own knowledge or amusement or as a cybernetic augmentation do not need this. LLM providers want to drive adoption, and if you can be exposed as passing off LLM slop as your own, then nobody will use their stuff.

briandear 13 hours ago

Add prompt to ChatGPT

Get answer.

Rewrite in your own words.

Feed back to chatGpT to check for errors.

Done. Watermarking really doesn’t solve any problem a clever person can’t trivially circumvent.

[-]

genrilz 12 hours ago

Or easier: Use Llama or some other open weights model with a non-watermarking sampler.

dartos 13 hours ago

Well spammers will probably skip the manual “ Rewrite in your own words” step.

So it’s still useful is reducing spam.

[-]

from-nibly 5 hours ago

Until it gets blocked as spam and then they will get a watermark stripping agent and bam. They are back in business.

sim7c00 10 hours ago

they will use google translate to translate to chinese and back and feed it back into chatgpt to fix grammar mistakes yielded :D (sorry. ur right its still useful, but spammers be spammers!)

aaroninsf 11 hours ago

Why is this in nature.com?

Serious question: has that become pay-to-publish a la Forbes etc when I wasn't paying attention?