Everything talks about settlement to the 'authors'; is that meant to be shorthand for copyright holders? Because there are a lot of academic works in that library where the publisher holds exclusive copyright and the author holds nothing.
By extension, if the big publishers are getting $3000 per article, that could be a fairly significant windfall.
To be very clear on this point - this is not related to model training.
It’s important in the fair use assessment to understand that the training itself is fair use, but the pirating of the books is the issue at hand here, and is what Anthropic “whoopsied” into in acquiring the training data.
Buying used copies of books, scanning them, and training on it is fine.
> Buying used copies of books, scanning them, and training on it is fine.
But nobody was ever going to that, not when there are billions in VC dollars at stake for whoever moves fastest. Everybody will simply risk the fine, which tends to not be anywhere close to enough to have a deterrent effect in the future.
That is like saying Uber would have not had any problems if they just entered into a licensing contract with taxi medallion holders. It was faster to just put unlicensed taxis on the streets and use investor money to pay fines and lobby for favorable legislation. In the same way, it was faster for Anthropic to load up their models with un-DRM'd PDFs and ePUBs from wherever instead of licensing them publisher by publisher.
Anthropic literally did exactly this to train its models according to the lawsuit. The lawsuit found that Anthropic didn't even use the pirated books to train its model. So there is that
The lawsuit didn't find anything, Anthropic claimed this as part of the settlement. Companies settle without admission of wrongdoing all the time, to the extent that it can be bargained for.
If I'm reading this right yes the training was fair use, but I was responding to the claim that the pirated books weren't used to train commercially released LLMs. The judge complained that it wasn't clear what was actually used, from the June order https://fingfx.thomsonreuters.com/gfx/legaldocs/jnvwbgqlzpw/... [pdf]:
> Notably, in its motion, Anthropic argues that pirating initial copies of Authors’ books and millions of other books was justified because all those copies were at least reasonably necessary for training LLMs — and yet Anthropic has resisted putting into the record what copies or even sets of copies were in fact used for training LLMs.
> We know that Anthropic has more information about what it in fact copied for training LLMs (or not). Anthropic earlier produced a spreadsheet that showed the composition of various data mixes used for training various LLMs — yet it clawed back that spreadsheet in April. A discovery dispute regarding that spreadsheet remains pending.
The same could be said of grand larceny. The difference would seem to be a mix of social norms and, more notably for this conversation, very different consequences.
I think the most notable difference is that grand larceny actually deprived someone of something they would have otherwise had, while pirating something you couldn't afford to buy doesn't because there was no circumstance in which they were getting the money and piracy doesn't involve taking anything from them...
To an investor, that just looks like a pretty good deal, I reckon. It's just the cost of doing business - which in my opionion is exactly what is wrong with practices like these.
> which in my opionion is exactly what is wrong with practices like these.
What's actually wrong with this?
They paid $1.5B for a bunch of pirated books. Seems like a fair price to me, but what do I know.
The settlement should reflect society's belief of the cost or deterrent, I'm not sure which (maybe both).
This might be controversial, but I think a free society needs to let people break the rules if they are willing to pay the cost. Imagine if you couldn't speed in a car. Imagine if you couldn't choose to be jailed for nonviolent protest.
This isn't some case where they destroyed a billion dollars worth of pristine wilderness and got off with a slap on the wrist.
I'm still sore that I got punished repeatedly for not doing my homework but they still insisted that I do the homework. "I'll mark it zero! You'll fail!" "OK." "Fucking do the work you little shit"
I agree to some extent, but there is a slippery slope to “no rules apply to the rich”.
I do agree that in the case of victimless crimes, having some ability to recompensate for damages instead of outright ban the thing, means that we can enact many massively net-positive scenarios.
Of course, most crimes aren’t victimless and that’s where the negative reactions are coming from (eg company pollutes the commons to extract a profit).
> I think a free society needs to let people break the rules if they are willing to pay the cost
so you don't think super rich people should be bound by laws at all?
Unless you made the cost proportional to (maybe expontial to) somebody's wealth, you would be creating a completely lawless class who would wreak havoc on society.
What you describe is in fact what Waymo has had, of chosen to, deal with. They didn't go for an end run around regulations related to vehicles on public roads. They committed to driverless vehicles and worked with local governments to roll it out as quickly as regulators were willing to allow.
Uber could have made the same decision and worked with regulators to be allowed into markets one at a time. It was an intentional choice to lean on the fact that Uber drivers blended into traffic and could hide in plain sight until Uber had enough market share and customer base to give them leverage.
I got to meet him and person and tell him that his books (along with The Coming Technological Singularity) had a huge influence on my decision to go into ML. He seemed pleased. I just wish he had wrapped up the Fire Upon the Deep series.
To be even more clear - this is a settlement, it does not establish precedent, nor admit wrongdoing. This does not establish that training is fair use, nor that scanning books is fine. That's somebody else's battle.
I suspect that ruling legally gets wiped off the books by the settlement since the case gets dismissed, no?
Even if the ruling legally remains in place after the settlement, district court rulings are at most persuasive precedent and not binding precedent in future cases, even ones handled by the same court. In the US federal court system, only appellate rulings at either the circuit court of appeals level or the Supreme Court level are binding precedent within their respective jurisdictions.
The ruling also doesn’t establish precedent, because it is a trial court ruling, which is never binding precedent, and under normal circumstances can’t even be cited as persuasive precedent, and the settlement ensures there will be no appellate ruling.
Which is very important for e.g. the NYT lawsuit against OpenAI. Basically there’s now precedent that training AI models on text and them producing output is not copyright infringement.
Everyone has more than a right to freely have read everything is stored in a library.
(Edit: in fact initially I wrote 'is supposed to' in place of 'has more than a right to' - meaning that "knowledge is there, we made it available: you are supposed to access it, with the fullest encouragement").
By US law, cccording to Author's Guild vs Google[1] on the Google book scanning project, scanning books for indexes is fair use.
Additionally:
> Every human has the right to read those books.
Since when?
I strongly disagree - knowledge should be free.
I don't think the author's arrangement of the words should be free to reproduce (ie, I think some degree of copyright protection is ethical) but if I want to use a tool to help me understand the knowledge in a book then I should be able to.
I think he implies that because one can borrow hypothetically any book for free from a library, one could use them for legal training purposes, so the requirement of having your own copy should be moot
Libraries aren’t just anarchist free for alls they are operating under licensing terms. Google had a big squabble with the university of Illinois Urbana Champaign research library before finally getting permission to scan the books there. Guess what, Google has the full text but books.google.com only shows previews, why is an exercise to the reader literally
Libraries are neither anarchist free for alls nor are they operating under licensing terms with regards to physical books.
They're merely doing what anyone is allowed to with the books that they own, loaning them out, because copyright law doesn't prohibit that, so no license is needed.
Yup. And if Anthropic CEO or whoever wants to drive down to the library and check out 30 books (or whatever the limit is), scan them, and then return them that is their prerogative I guess.
There are no terms and conditions attached to library books beyond copyright law (which says nothing about scanning) and the general premise of being a library (return the book in good condition on time or pay).
Copyright law in the USA may be more liberal about scanning than other jurisdictions (see the parallel comment from gpm), which expressly regulate the amount of copying of material you do not own as an item.
The jurisdictions I'm familiar with all give vague fair use/fair dealing exceptions which would cover some but not all copying (including scanning) with less than clear boundaries.
I'd be interested to know if you knew of one with bright line rules delineating what is and isn't allowed.
I knew about library genesis by 2012. It was at least 10 TiB large by then, IIRC. With the amount of Russian language content I got the impression it was more popular in that sphere, but an impressive collection for anyone and not especially secret.
It's in the megathread linked in this comment, but I want to specifically point to https://open-slum.org/ which is basically a status page for different sites dedicated to this purpose, and which I've found helpful.
He got into trouble for breaking into an unsecured network closet at MIT and using MIT credentials to download a bunch of copyrighted content.
The whole incident is written up in detail, https://swartz-report.mit.edu/ by Hal Abelson (who wrote SICP among other things). It is a well-researched document.
I think the parent may be getting at why he was downloading the content. I don't know the answer to this. Maybe someone here does. What was he intending to do with the articles?
The report speculates to his motivations on page 31, but it seems to be unknown with any certainty.
Swartz, like many of us, see pay-for-access journals as an affront. I believe he wanted to "liberate" the content of these articles so that more people could read them.
Information may want to be free, but sometimes it takes a revolutionary to liberate it.
Anthropic legally purchased the books it used to train its model according to the judge. And the judge said that was fine. Anthropic also downloaded books from a pirate site and the judge said that was bad -- even though the judge also said they didn't use those books for training....
Books.Google.Com was deemed fair use because it only shows previews, not full downloads. Internet Archive is still under litigation iirc besides having owned a physical copy of every book they ever scanned (and keeping a copy in their warehouses) they let people read the whole thing.
I’m surprised Google hasn’t hit its competitors harder with the fact that they actually got permission to scan books from its partner libraries and Facebook and OpenAI just torrented books2/books3, but I guess they have aligned incentive to benefit from a legal framework that doesn’t look to closely at how you went about collecting source material
This is excellent news because it means that folks who pay for printed books and scan them also can train with their content. It's been said already that we've already trained on "the entire (public) internet." Printed books still hold a wealth of knowledge that could be useful in training models. And cheap, otherwise unwanted copies make great fodder for "destructive" scanning where you cut the spine off and feed it to a page scanner. There are online services that offer just that.
Yes, but the cat is out of the bag now. Welcome to the era of every piece of creative work coming with an EULA that you cannot train on it. It will be like clearing samples.
I feel like proportionality is related also to the scale. If a student pirates a textbook, I’d agree that 100x is excessive, but this is a corporation handsomely profiting off of mass piracy.
It’s crazy to imagine, but there was surely a document or slack message thread discussing where to get thousands of books, and they just decided to pirate them and that was OK. This was entirely a decision based on ease or cost, not based on the assumption it was legal. Piracy can result in jail time IIRC, so honestly it’s lucky the employee who suggested this, or took the action avoided direct legal liability.
Oh and I’m pretty sure other companies (meta) are in litigation over this issue, and the publishers knew that settlement below the full legal limit would limit future revenue.
Investment is debt lol. Maybe you can make the argument that you're increasing the equity value but you do have to eventually prove you're able to make money right? Maybe you don't, this system is pretty messed up after all.
Not if 100 companies did it and they all got away.
This is to teach a lesson because you cannot prosecute all thieves.
Yale Law Journal actually writes about this, the goal is to deter crime because in most cases damages cannot be recovered or the criminal will never be caught in the first place.
If in most cases damages cannot be recovered or the criminal will never be caught in the first place, then what is the lesson being taught? Doesn't that just create a moral hazard where you "randomly" choose who to penalize?
As long as they haven't been bullied into the corporate equivalent of suicide by the "justice" system it's not disproportionate considering what happened to Aaron Schwartz.
Well it's willful infringement so a court would be entitled to add a punitive multiplier anyway. But this is something Anthropic agreed to, if that wasn't clear.
In this specific case the settlement caps the lawyer fees at 25%, and even that is subject to the courts approval. In addition they will ask for $250k total ($50k / plaintiff) for the lead plaintiffs, also subject to the courts approval.
It is related to scalable mode training, however. Chopping the spine off books and putting the pages in an automated scanner is not scalable. And don't forget about the cost of 1) finding 2) purchasing 3) processing and 4) recycling that volume of books.
> Chopping the spine off books and putting the pages in an automated scanner is not scalable.
That's how Google Books, the Internet Archive, and Amazon (their book preview feature) operated before ebooks were common. It's not scalable-in-a-garage but perfectly scalable for a commercial operation.
We hem and haw about metaphorical "book burning" so much we forget that books themselves are not actually precious.
The books that are destroyed in scanning are a small minority compared to the millions discarded by libraries every year for simply being too old or unpopular.
>we forget that books themselves are not actually precious.
Book burnings are symbolic (Unless you're in the world of Fareinheit 451). The real power comes from the political threat, not the fact that paper with words on them is now unreadable.
Well, the famous 1933-05-10 book burning did destroy the only copies of a lot of LGBT medical research, and destroying the last copy of various works was a stated intent of Nazi book burnings.
> It’s important in the fair use assessment to understand that the training itself is fair use,
I think that this is a distinction many people miss.
If you take all the works of Shakespeare, and reduce it to tokens and vectors is it Shakespeare or is it factual information about Shakespeare? It is the latter, and as much as organizations like the MLB might want to be able to copyright a fact you simply cannot do that.
Take this one step further. IF you buy the work, and vectorize it, thats fine. But if you feed it in the vectors for Harry Potter so many times that it can reproduce half of the book, it becomes a problem when it spits out that copy.
And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.
If you use an LLM to generate your code at work, can you leave with that code when you quit? Does GPL3 or something like the Elastic Search license even apply if there is no copyright?
I suspect we're going to be talking about court cases a lot for the next few years.
The question is going to be how much human intellectual input there was I think. I don't think it will take much - you can write the crappiest novel on earth that is complete random drivel and you still have copyright on it.
So to me, if you are doing literally any human review, edits, control over the AI then I think you'll retain copyright. There may be a risk that if somebody can show that they could produce exactly the same thing from a generic prompt with no interaction then you may be in trouble, but let's face it should you have copyright at that point?
This is, however, why I favor stopping slightly short of full agentic development at this point. I want the human watching each step and an audit trail of the human interaction in doing it. Sure I might only get to 5x development speed instead of 10x or 20x but that is already such an enormous step up from where we were a year ago that I am quite OK with that for now.
Yes. Someone on this post mentioned that switzerland allows downloading copyrightable material but not distributing them.
So things get even more dark because what becomes distribution can have a really vague definition and maybe the AI companies will only follow the law just barely, just for the sake of not getting hit with a lawsuit like this again. But I wonder if all this case did was maybe compensate the authors this one time. I doubt if we can see a meaningful change towards AI companies attitude's towards fair use/ essentially exploiting authors.
I feel like that they would try to use as much legalspeak as possible to extract as much from authors (legally) without compensating them which I feel is unethical but sadly the law doesn't work on ethics.
Switzerland has five main collecting societies: ProLitteris for literature and visual arts, the SSA (Société Suisse des Auteurs) for dramatic works, the SUISA for music, Suissimage for audiovisual works, and SWISSPERFORM for related rights like those of performers and broadcasters. These non-profit societies manage copyright and related rights on behalf of their members, collecting and distributing royalties from users of their works.
Note that the law specifically regulates software differently, so what you cannot do is just willy nilly pirate games and software.
What distribution means in this case is defined in the swiss law. However swiss law as a whole is in some ways vague, to leave a lot up to interpretation by the judiciary.
I would assume it would compensate the publisher. Authors often hand ownership to the publisher; there would be obvious exceptions for authors who do well.
> And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.
This seems too cute by half, courts are generally far more common sense than that in applying the law.
This is like saying using `rails generate model:example` results in a bunch of code that isn't yours, because the tool generated it according to your specifications.
The example is a real legal case afaik, or perhaps paraphrased from one (don’t think it was a monkey - an ape? An elephant?).
I’d guess the legal scenario for `rails generate` is that you have a license to the template code (by way of how the tool is licensed) and the template code was written by a human so licensable by them and then minimally modified by the tool.
I think you're thinking of this case [1], it was a monkey, it wasn't a painting but a selfie. A painting would have only made the no-copyright argument stronger.
I don’t think the code you get from rails generate is yours. Certainly not by way of copyright, which protects original works of authorship and so if it’s not original, it’s not copyrightable, and yes it’s been decided in US courts that non-human-authorship doesn’t count as creative.
> courts are generally far more common sense than that in applying the law.
'The Board’s decision was later upheld by the U.S. District Court for the District of Columbia, which rejected the applicant’s contention that the AI system itself should be acknowledged as the author, with any copyrights vesting in the AI’s owner. The court further held that the CO did not act arbitrarily or capriciously in denying the application, reiterating the requirement that copyright law requires human authorship and that copyright protection does not extend to works “generated by new forms of technology operating absent any guiding human hand, as plaintiff urges here.”' From: https://www.whitefordlaw.com/news-events/client-alert-can-wo...
The court is using common sense when it comes to the law. It is very explicit and always has been... That word "human" has some long standing sticky legal meaning (as opposed to things that were "property").
I mean, sort of. The issue is that the compression is novel. So anything post tokenization could arguably be considered value add and not necessarily derivative work.
I guess they must delete all models since they acquired the source illegally and benefitted from it, right? Otherwise it just encourages others to keep going and pay the fines later.
In a prior ruling, the court stated that Anthropic didn't train on the books subject to this settlement. The record is that Anthropic scanned physical books and used those for training. The pirated books were being held in a general purpose library and were not, according to the record, used in training.
According to the judge, they didn't. The judge said they stored those books in a general purpose library for future use just in case they decided to use them later. It appears the judge took much issue with the downloading of "pirated content." And Anthropic decided to settle rather than let it all play out more.
That is something which is extremely difficult to prove from either side.
It is 500,000 books in total so did they really scan all those books instead of using the pirated versions? Even when they did not have much money in the early phases of the model race?
The 500,000 number is the number of books that are part of the settlement. If they downloaded all of Libgen and the other sources it was more like >7Million. But it is a lot of work to determine which books can legitimately be part of the lawsuit. For example, if any of the books in the download weren't copyright (think self published) or not protected under US copyright law (maybe a book only published in Venezula) or it isn't clear who own the copyright then that copyright owner cannot be part of the class. So it seems like the 500,000 number is basically the smaller number of books for which the lawyers for the plaintiff felt they could most easily prove standing.
> Buying used copies of books, scanning them, and training on it is fine.
Buying used copies of books, scanning them, and printing them and selling them: not fair use
Buying used copies of books, scanning them, and making merchandise and selling it: not fair use
The idea that training models is considered fair use just because you bought the work is naive. Fair use is not a law to leave open usage as long as it doesn’t fit a given description. It’s a law that specifically allows certain usages like criticism, comment, news reporting, teaching, scholarship, or research.
Training AI models for purposes other than purely academic fits into none of these.
Buying used copies of books, scanning them, training an employee with the scans: fair use.
Unless legislation changes, model training is pretty much analogous to that. Now of course if the employee in question - or the LLM - regurgitates a copyrighted piece verbatim, that is a violation and would be treated accordingly in either case.
Simultaneously I guess that would violate copyright, which is an interesting point. Maybe there's a case to be made there with model training.
Regardless, the issue could be resolved by buying as many copies as you have concurrent model training instances. It isn't really an issue with training on copyrighted work, just a matter of how you do so.
The purpose and character of AI models is transformative, and the effect of the model on the copyrighted works used in the model is largely negligible. That's what makes the use of copyrighted works in creating them fair use.
1. A Settlement Fund of at least $1.5 Billion: Anthropic has agreed to pay a minimum of $1.5 billion into a non-reversionary fund for the class members. With an estimated 500,000 copyrighted works in the class, this would amount to an approximate gross payment of $3,000 per work. If the final list of works exceeds 500,000, Anthropic will add $3,000 for each additional work.
2. Destruction of Datasets: Anthropic has committed to destroying the datasets it acquired from LibGen and PiLiMi, subject to any legal preservation requirements.
3. Limited Release of Claims: The settlement releases Anthropic only from past claims of infringement related to the works on the official "Works List" up to August 25, 2025. It does not cover any potential future infringements or any claims, past or future, related to infringing outputs generated by Anthropic's AI models.
Don't forget: NO LEGAL PRECEDENT! which means, anybody suing has to start all over. You only settle in this scenario/point if you think you'll lose.
Edit: I'll get ratio'd for this- but its the exact same thing google did in it's lawsuit with Epic. They delayed while the public and courts focused in apple (oohh, EVIL apple)- apple lost, and google settled at a disadvantage before they had a legal judgment that couldn't be challenged latter.
A valid touche! I still think google went with delaying tactics as public and other pressures forced Apple's case forward at greater velocity. (Edit: implicit "and then caved when apple lost"... because they're the same case)
Thank you. I assumed it would be quicker to find the link to the case PDF here, but your summary is appreciated!
Indeed, it is not only payout, but the destruction of the datasets. Although the article does quote:
> “Anthropic says it did not even use these pirated works,” he said. “If some other generative A.I. company took data from pirated source and used it to train on and commercialized it, the potential liability is enormous. It will shake the industry — no doubt in my mind.”
Even if true, I wonder how many cases we will see in the near future.
Bootstrapping in the startup world refers to starting a startup using only personal resources instead of using investors. Anthropic definitely had investors.
Few. This settlement potentially weakens all challenges to the use of copyrighted works in training LLM's. I'd be shocked if behind closed doors there wasn't some give and take on the matter between Executives/investors.
A settlement means the claimants no longer have a claim, which means if they're also part of- say, the New York Times affiliated lawsuit- they have to withdraw. A neat way of kneecapping a country wide decision that LLM training on copy written material is subject to punitive measures don't you think?
That's not even remotely true. Page 4 of the settlement describes released claims which only relate to the pirating of books. Again, the amount of misinformation and misunderstanding I see in copyright related threads here ASTOUNDS.
Did you miss the "also" how about "adjacent"? I won't pretend to understand the legal minutia, but reading the settlement doesn't mean you do either.
In my experience&training in a fintech corp- Accepting a settlement in any suit weakens your defense- but prevents a judgement and future claims for the same claims from the same claimants (a la double jeopardy). So, again- at minimum- this prevents an actual judgement. Which, likely would be positive for the NYT (and adjacent) cases.
> Although the payment is enormous, it is small compared with the amount of money that Anthropic has raised in recent years. This month, the start-up announced that it had agreed to a deal that brings an additional $13 billion into Anthropic’s coffers. The start-up has raised a total of more than $27 billion since its founding in 2021.
Maybe small compared to the money raised, but it is in fact enormous compared to the money earned. Their revenue was under $1b last year and they projected themselves as likely to make $2b this year. This payout equals their average yearly revenue of the last two years.
Here is an article that discusses why those numbers are misleading[1]. From a high level, "run rate" numbers are typically taking a monthly revenue number and multiplying it by 12 and that just isn't an accurate way to report revenue for reasons outlined in that article. When it comes to actual projections for annual revenue, they have said $2b is the most likely outcome for their 2025 annual revenue.
But what are the profits? 1.5B is a huge amount, no matter what, especially if you’re committing to destroying the datasets as well. That implies you basically used 1.5B for a few years of additional training data, a huge price.
It doesn't matter if they end up in chapter 11... If it kneecaps all the other copyright lawsuits. I won't pretend to know the exact legal details. But I am (unfortunately) old enough that this isn't my first "giant corporation benefits from legally and ethically dubious copyright adjacent activities, gets sued, settles/wins." (Cough, google books)
Personally I believe in the ideal scenario (for the fed govt.) these firms will develop the tech. The fed will then turn around and want those law suits to win - effectively gutting the firms financially and putting the tech in the hands of the public sector.
You never know, its a game of interests and incentives - one thing for sure - does does the fed want the private sector to own and control a technology of this kind? Nope.
If they are going to be making Billions in net income every year going forward, as many years as analysts can make projections for, and using these works allowed them to GTM faster/quicker/gain advantage against competitors, then it is quite great from a business prospective.
If it allowed them to move faster than their completion, I imagine management would consider it money well spent. They are expected to spend absurd amounts of money to get ahead. They were never expected to spend money efficiently if it meant taking additional months/years to get results.
Yeah it does, cost of materials is way more than that if they were building something physical like a new widget or something. Same idea, they paid for their raw materials.
You're joking, but that's actually a good pitch. There was a significant legal issue hanging over their heads, with some risk of a potentially business-ending judgment down the line. This makes it go away, which makes the company a safer, more valuable investment. Both in absolute terms and compared to peers who didn't settle.
It just resolves their liability with regards to books they purported they did not even train the models on, which is all that was left in this case after summary judgment. Sure the potential liability was company ending, but it's all a stupid business decision when it is ultimately for books they did not even train on.
It basically does nothing for them besides that. Given the split decisions so far, I'm not sure what value the Alsup decision is going to bring to the industry, moving forward, when it's in the context of books that Anthropic physically purchased. The other AI cases are generally not fact patterns where the LLM was trained with copyrighted materials that the AI company legally purchased copies of.
Isn't that how the whole system operates? Everyone is a conduit to allow rich people to enrich themselves further. The amount and quality of opportunities any individual receives are proportional to how well it serves existing capital.
So long as there is an excuse to justify money flows, that's fine, big capital doesn't really care about the excuse; so long as the excuse is just persuasive enough to satisfy the regulators and the judges.
Money flows happen independently, then later, people try to come up with good narratives. This is exactly what happened in this case. They paid the authors a lot of money as a settlement and agreed on a narrative which works for both sets of people; that training was fine, it's the pirating which was a problem...
It's likely why they settled; they preferred to pay a lot of money and agree on some false narrative which works for both groups rather than setting a precedent that AI training on copyrighted material is illegal; that would be the biggest loss for them.
> Isn't that how the whole system operates? Everyone is a conduit to allow rich people to enrich themselves further. The amount and quality of opportunities any individual receives are proportional to how well it serves existing capital.
I can't help but feel like this is a huge win for Chinese AI. Western companies are going to be limited in the amount of data they can collect and train on, and Chinese (or any foreign AI) is going to have access to much more and much better data.
The West can end the endless pain and legal hurdles to innovation by limiting the copyright. They can do it if there is will to open up the gates of information to everyone. The duration of 70 years after death of the author or 90 years for companies is excessively long. It should be ~25 years. For software it should be 10 years.
And if AI companies want recent stuff, they need to pay the owners.
However, the West wants to infinitely enrich the lucky old people and companies who benefited from the lax regulations at the start of 20th century. Their people chose to not let the current generations to acquire equivalent wealth, at least not without the old hags get their cut too.
This is sad for open source AI, piracy for the purpose of model training should also be fair use because otherwise only the big companies who can afford to pay off publishers like Anthropic will be able to do so. There is no way to buy billions of books just for model training, it simply can't happen.
Fair use isn't about how you access the material, its about what you can do with it after you legally access it. If you don't legally access it, the question of fair use is moot.
This is a settlement. It does not set a precedent nor even admit to wrongdoing.
> otherwise only the big companies who can afford to pay off publishers like Anthropic will be able to do so
Only well funded companies can afford to hire a lot of expensive engineers and train AI models on hundreds of thousands of expensive GPUs, too.
Something tells me many the grassroots LLM training people are less concerned about legality of their source training set than the big companies anyway.
It implies that people want everyone to do this when it's clear no one should do it. I'm not exactly a fan of "this isn't profitable for small businesses to steal from so we should make it so everyone should steal".
Piracy is not stealing. I don't know why everyone on HN suddenly turned into a copyright hawk, only big companies benefit from our current copyright regime, like Disney and their lobbying for increasing its length.
No. It means model training is transformative enough to be fair use. They should just be asked to pay them back plus reimbursement/punishment, say pay 10x the price of the pirated books
Yes to the first part. Put your site behind a login wall that requires users to sign a contract to that effect before serving them the content... get a lawyer to write that contract. Don't rely on copyright.
I'm not sure to what extent you can specify damages like these in a contract, ask the lawyer who is writing it.
Contracts generally require an exchange of consideration (something of value, like money).
If you put a “contract” on your website that users click through without paying you or exchanging value with you and then you try to collect damages from them according to your contract, it’s not going to get you anywhere.
Maybe some kind of captcha like system could be devised that could be considered a security measure under the DMCA and not allowed to be circumvented. Make the same content available under a licence fee through an API.
>I'd argue you don't actually want this! You're suggesting companies should be able to make web scraping illegal.
At this point, we do need some laws regulating excessive scraping. We can't have the ineternet grind to a halt over everyone trying to drain it of information.
I'm sure one can try, but copyright has all kinds of oddities and carve-outs that make this complicated. IANAL, but I'm fairly certain that, for example, if you tried putting in your content license "Free for all uses public and private, except academia, screw that ivory tower..." that's a sentiment you can express but universities are under no obligation legally to respect your wish to not have your work included in a course presentation on "wild things people put in licenses." Similarly, since the court has found that training an LLM on works is transformative, a license that says "You may use this for other things but not to train an LLM" couldn't be any more enforceable than a musician saying "You may listen to my work as a whole unit but God help you if I find out you sampled it into any of that awful 'rap music' I keep hearing about..."
The purpose of the copyright protections is to promote "sciences and useful arts," and the public utility of allowing academia to investigate all works(1) exceeds the benefits of letting authors declare their works unponderable to the academic community.
(1) And yet, textbooks are copyrighted and the copyright is honored; I'm not sure why the academic fair-use exception doesn't allow scholars to just copy around textbooks without paying their authors.
That metaphor doesn't really work. It's a settlement, not a punishment, and this is payment, not a fine. Legally it's more like "The store wasn't open, so I took the items from the lot and paid them later".
It's not the way we expect people to do business under normal circumstances, but in new markets with new products? I guess I don't see much actually wrong with this. Authors still get paid a price they were willing to accept, and Anthropic didn't need to wait years to come to an agreement (again, publishers weren't actually selling what AI companies needed to buy!) before training their LLMs.
I’m sure this’ll be misreported and wilfully misinterpreted because of the current fractious state of the AI discourse, but given the lawsuit was to do with piracy, not the copyright-compliance of LLMs, and in any case, given they settled out of court, thus presumably admit no wrongdoing, conveniently no legal precedent is established either way.
I would not be surprised if investors made their last round of funding contingent on settling this matter out of court precisely to ensure no precedents are set.
Anthropic certainly seems to be hoping that their competitors will have to face some consequences too:
>During a deposition, a founder of Anthropic, Ben Mann, testified that he also downloaded the Library Genesis data set when he was working for OpenAI in 2019 and assumed this was “fair use” of the material.
Per the NYT article, Anthropic started buying physical books in bulk and scanning them for their training data, and they assert that no pirated materials were ever used in public models. I wonder if OpenAI can say the same.
Maybe, though this lawsuit is different in respect to the piracy issue. Anthropic is paying the settlement because they pirated the books, not because training on copyrighted books isn’t fair use which isn’t necessarily true with the other cases.
How do legal penalties and settlements work internationally? Are entities in other countries somehow barred from filing similar suits with more penalties?
Maybe I would think differently if I was a book author but I can't help but think that this is ugly but actually quite good for humanity in some perverse sense. I will never, ever, read 99.9% of these books presumably but I will use claude.
(Sorry, meta question: how do we insert in submissions that "'Also' <link> <link>..." below the title and above the comment input? The text field in the "submit" page creates a user's post when the "url" field is also filled. I am missing something.)
Copies of these books are for sale for much less than that - very very few books demand a price that high.
They're paying much more than the actual damages because US copyright law comes with statutory damages for infringement of registered works on top of actual damages, between $200 and $150,000 per work. And the two sides negotiated this as a fair settlement to reduce the risk of an unfavourable outcome.
I wonder who will be the first country to make an exception to copyright law for model training libraries to attract tax revenue like Ireland did for tech companies in the EU. Japan is part of the way there, but you couldn't do a common crawl type thing. You could even make it a library of congress type of setup.
As long as you're not distributing, it's legal in Switzerland to download copyrighted material. (Switzerland was on the naughty US/MPAA list for a while, might still be)
Is it distribution though if someone trains a model in switzerland through downloading copyrighted material, training AI on it and then distributing it...
Or what if not even distributing it but rather distributing the outputs of the LLM (so closed source LLM like anthropic)
I am genuinely curious as to if there is some gray area that might be exploited by AI companies as I am pretty sure that they don't want to pay 1.5B dollars yet still want to exploit the works of authors. (let's call a spade a spade)
Using copyrighted material to train AI is a legal grey zone. The nyt vs openAI case is litigating this. The anthropic settlement here is about how the material is obtained. If openAI wins their case and switzerland rules the same way I dont think there would be a problem
This might go down (I think) to be one of the most influential court cases to happen then.
We really are getting at some metaphysical / philosophical questions and maybe we will one day arrive at a question that just can't be answered (I think this is pretty close, right?) and then AI companies would do things freely without being accountable since sure you could take to the courts but how would you come to the decision...?
Another question though
So lets say that the nyt vs openAI case is going on, so in the meantime while they are litigating (lets say), could OpenAI still continue doing the same thing while the case is going on?
They also agreed to destroy the pirated books. I wonder how large of a portion of their training data comes from these shadow libraries, and if AI labs in countries that have made it clear they won't enforce anti-piracy laws against AI companies will get a substantial advantage by continuing to use shadow libraries.
They already, prior to this lawsuit, prior to serving public models, replaced this data set with one they made by scanning purchased books. Destroying the data set they aren't even using should have approximately zero effect.
> "The technology at issue was among the most transformative many of us will see in our lifetimes"
A judge making on a ruling based on his opinion of how transformative a technology will be doesn't inspire confidence. There's an equivocation on the word "transformative" here -- not just transformative in the fair use sense, but transformative as in world-changing, impactful, revolutionary. The latter shouldn't matter in a case like this.
> Companies and individuals who willfully infringe on copyright can face significantly higher damages — up to $150,000 per work
Settling for 2% is a steal.
> In June, the District Court issued a landmark ruling on A.I. development and copyright law, finding that Anthropic’s approach to training A.I. models constitutes fair use,” Aparna Sridhar, Anthropic’s deputy general counsel, said in a statement.
This is the highest-order bit, not the $1.5B in settlement. Anthropic's guilty of pirating.
Printing press, audio recording, movies, radio, television were also transformative. Did not get rid of copyright or actually brought them.
I feel it is insane that authors do not receive some sort of standard compensation for each training use. Say a few hundred to a few thousand depending on complexity of their work.
Same reason why the enterprise edition is more expensive than personal. Companies have more money to give and usually use it to generate profit. Individuals do not.
Because ones doing the training are profiting from it. Ai is not a human with limited time. And it is also owned by a company not a legal person.
I might find argument of comparing it to human when it is fully legal person and cutting power to it or deleting is treated as murder. Before that it is just bullshit.
And fundamentally reason for copy right to exist is to support creators and to promote them to create more. In world where massively funded companies can freely exploit their work and even in many case fully substitute that principle is failed.
If I buy a book, learn something, and then profit from it, should I also be paying more than the original price to read the book?
> Ai is not a human with limited time
AI is also bound by time, physics, and limited capacity. It does certain things better or faster than us, it fails miserably at certain things we don't even think about being complex (like opening a door)
> And it is also owned by a company not a legal person.
For the purpose of legalities, companies and persons are relatively equivalent, regardless of the merits, it is how it is
> In world where massively funded companies can freely exploit their work and even in many case fully substitute that principle is failed.
They paid for the books after getting caught, the other companies are paying for the copyrighted training materials
>They paid for the books after getting caught, the other companies are paying for the copyrighted training materials
Are they paying reasonable compensation? Say like with streaming services, movie theatres, radio and tv stations. As a whole their model is much close to those than individuals buying books, cds or dvds...
You might even consider Theatrical License or Public Performance License. Paid even if you have memorized a thing...
LLMs are just bad technology that require massive amount of inputs so the authors cannot be compensated enough for it. And I fully believe they should be. And lot more than single copy of their work under entirely ill-fitting first-sale doctrine does.
For legal observers, Judge William Haskell Alsup’s razor-sharp distinction between usage and acquisition is a landmark precedent: it secures fair use for transformative generative AI while preserving compensation for copyright holders. In a just world, this balance would elevate him to the highest court of the land, but we are far from a just world.
This weirdly seems like its the best mechanism to buy this much data.
Imagine going to 500k publishers to buy it individually. 3k per book is way cheaper. The copyright system is turning into a data marketplace in front of our eyes
I suspect you could acquire and scan every readily purchasable book for much less than $3k each. Scanhouse for instance charges $0.15 per page for regular unbound (disassembled) books, plus $0.25 for supervised OCR, plus another dollar if the formatting is especially complex; this comes out to maybe $200-300 for a typical book. Acquiring, shipping, and disposing of them all would of course cost more, but not thousands more.
The main cost of doing this would be the time - even if you bought up all the available scanning capacity it would probably take months. In the meantime your competition who just torrented everything would have more high-quality training data than you. There are probably also a fair number of books in libgen which are out of print and difficult to find used.
The court has to give preliminary approval to the settlement first. After that there should be a notice period during which the lawyers will attempt to reach out and tell you what you need to do to receive your money. (Not a lawyer, not legal advice).
For what it's worth the friction exists for a reason, conflicts of interest.
The lawyers suing Anthropic here will probably walk away with several hundred million dollars - they have won the lottery.
If they managed to extract twice as much money from Anthropic for the class, they'd walk away with probably twice as much... but winning the lottery twice isn't actually much better than winning the lottery once. Meanwhile $4500 is a lot more than $2250 (the latter is a reasonable estimate of how much you'll get per work after the lawyers cut). Which risks the lawyers settling for less than is in their clients best interests so that they can reliably get rich.
Personally (not a lawyer or anything) I think this settlement seems very fair, and I expect the court will approve it. But there's definitely been plenty of class actions in the past where lawyers really did screw over the class and (try to) settle for less than they should have to avoid risking going to trial.
... in one economy and for specific authors and publishers. But the offence is global in impact on authors worldwide, and the consequences for other IPR laws remains to be seen.
$1.5B is a nothing but a handslap for the big gold rush companies.
It's less than 1% Anthropic's valuation -- a valuation utterly dependent on all the hoovering up of others' copyrighted works.
AFAICT, if this settlement signals that the typical AI foundation model company's massive-scale commercial theft doesn't result in judgments that wipe out a company (and its execs), then we have confirmation that is a free-for-all for all the other AI gold rush companies.
Then making deals to license rights, in sell-it-to-us-or-we'll-just-take-it-anyway deals, becomes only a routine and optional corporate cost reduction exercise, but not anything the execs will lose sleep over if it's inconvenient.
There’s alternatives to wiping out the company that could be fair. For example, a judgement resulting in a shares of the company or revenue shares in the future rather than a one time pay off.
Writers were the true “foundational” piece of LLMs, anyway.
If this is an economist idea of fair, where is the market?
If someone breaks into my house and steals my valuables, without my consent, then giving me stock in their burglary business isn't much of a deterrent to them and other burglars.
Deterrence/prevention is my real goal, not the possibly of a token settlement from whatever bastard rips me off.
We need the analogue of laws and police, or the analogue of homeowner has a shotgun.
I don't much like the idea of settling in stock, but I also think you're looking for criminal law here. Civil law, and this is a civil suit, is far more concerned with making damaged parties whole than acting as a deterrent.
I understand that intentional copyright infringement is a crime in the US, you just need to convince the DOJ to prosecute Anthropic for it...
> A trial was scheduled to begin in December to determine how much Anthropic owed for the alleged piracy, with potential damages ranging into the hundreds of billions of dollars.
It has been admitted and Anthropic knew that this trial would totally bankrupt them had they said they were innocent and continued to fight the case.
But of course, there's too much money on the line, which means even though Anthropic settled (admitting guilt and profiting off of pirated books) they (Anthropic) knew there was no way they could win that case, and was not worth taking that risk.
> The pivotal fair-use question is still being debated in other AI copyright cases. Another San Francisco judge hearing a similar ongoing lawsuit against Meta ruled shortly after Alsup's decision that using copyrighted work without permission to train AI would be unlawful in "many circumstances."
If it was a sure thing, then the rights holders wouldn't have accepted a settlement deal for a measly couple billion. Both sides are happier to avoid risking losing the suit.
Also knowing how pro corporate the legal system is piercing the veil and going after everyone holding the stock would have been unlikely. So getting 1,5 billion out of them likely could have been reasonable move. Otherwise they could have just burned all the money and flipped what was leftover to someone else, with uncertain price and horizon.
I'm excited for the moment where these models are able to treat using copyrighted work in a fair-use way that pays out to authors the way Spotify does when you listen to a song. Why? Because authors recieving royalties for their works when they get used in some prompt would likely encourage them to become far more accepting towards LLMs.
Also passing on the cost to consumers of generated content since companies now would need to pay royalties on the back-end should also likely increase the cost of generating slop and hopefully push back against that trend.
This shouldn't just be books, but all written content, like scholarly journals and essays, news articles and blogs, etc.
I realize this is just wishful thinking, but there's got to be some nugget of aspirational desire to pay it forward.
If I walked into a store and stole $1000 of books I would go to jail. If a tech company steals countless thousands of dollars worth of books, someone should go to jail.
This settlement I guess could be a landmark moment. $1.5 billion is a staggering figure and I hope it sends a clear signal that AI companies can’t just treat creative work as free training data.
All the ai companies are still using books as training data. Theyre just finding the cheapest scanned copies they can get their hands on to cover their ass
I'm gonna say one thing. If you agree that something was unfairly taken from book authors, then the same thing was taken from people publishing on the web, and on a larger scale.
Book authors may see some settlement checks down the line. So might newspapers and other parties that can organize and throw enough $$$ at the problem. But I'll eat my hat if your average blogger ever sees a single cent.
This is not a fine, it's a settlement to recompense authors.
More broadly, I think that's a goofy argument. The books were "freely available" too. Just because something is out there, doesn't necessarily mean you can use it however you want, and that's the crux of the debate.
It's not the crux of this case. This is a settlement based on the judge's ruling that they books had been illegally downloaded. The same judge said that the training itself was not the problem – it was downloading the pirated books. It will be tough to argue that loading a public website is an illegal download.
Books aren't hosted publicly online free for anyone to access. The court seems to think buying a book and scanning it is fair use. Just using pirated books is forbidden. Blogs weren't accessed via pirating.
It seems weird that there was legal culpability for downloading pirated books but not for training on them. At the very least, there is a transitive dependency between the two acts.
Other people have said that Anthropic bought the books later on, but I haven't found any official records for that. Where would I find that?
Also, does anyone know which Anthropic models were NOT trained on the pirated books. I want to avoid such models.
https://archive.ph/wugNc
Everything talks about settlement to the 'authors'; is that meant to be shorthand for copyright holders? Because there are a lot of academic works in that library where the publisher holds exclusive copyright and the author holds nothing.
By extension, if the big publishers are getting $3000 per article, that could be a fairly significant windfall.
To be very clear on this point - this is not related to model training.
It’s important in the fair use assessment to understand that the training itself is fair use, but the pirating of the books is the issue at hand here, and is what Anthropic “whoopsied” into in acquiring the training data.
Buying used copies of books, scanning them, and training on it is fine.
Rainbows End was prescient in many ways.
> Buying used copies of books, scanning them, and training on it is fine.
But nobody was ever going to that, not when there are billions in VC dollars at stake for whoever moves fastest. Everybody will simply risk the fine, which tends to not be anywhere close to enough to have a deterrent effect in the future.
That is like saying Uber would have not had any problems if they just entered into a licensing contract with taxi medallion holders. It was faster to just put unlicensed taxis on the streets and use investor money to pay fines and lobby for favorable legislation. In the same way, it was faster for Anthropic to load up their models with un-DRM'd PDFs and ePUBs from wherever instead of licensing them publisher by publisher.
> But nobody was ever going to that
Didn't Google have a long standing project to do just that?
https://en.wikipedia.org/wiki/Google_Books
Anthropic literally did exactly this to train its models according to the lawsuit. The lawsuit found that Anthropic didn't even use the pirated books to train its model. So there is that
The lawsuit didn't find anything, Anthropic claimed this as part of the settlement. Companies settle without admission of wrongdoing all the time, to the extent that it can be bargained for.
The judge's ruling from earlier certainly seemed to me to suggest that the training was fair use.
Obviously, that's not part of the current settlement. I'm no expert on this, so I don't know the extent to which the earlier ruling applies.
If I'm reading this right yes the training was fair use, but I was responding to the claim that the pirated books weren't used to train commercially released LLMs. The judge complained that it wasn't clear what was actually used, from the June order https://fingfx.thomsonreuters.com/gfx/legaldocs/jnvwbgqlzpw/... [pdf]:
> Notably, in its motion, Anthropic argues that pirating initial copies of Authors’ books and millions of other books was justified because all those copies were at least reasonably necessary for training LLMs — and yet Anthropic has resisted putting into the record what copies or even sets of copies were in fact used for training LLMs.
> We know that Anthropic has more information about what it in fact copied for training LLMs (or not). Anthropic earlier produced a spreadsheet that showed the composition of various data mixes used for training various LLMs — yet it clawed back that spreadsheet in April. A discovery dispute regarding that spreadsheet remains pending.
Sir. These were carpoolers, just sharing a ride to their new online friends' B&B.
Sure, but that’s mostly because the sheer convenience of the illegal way is so much higher, and carries zero startup cost.
The same could be said of grand larceny. The difference would seem to be a mix of social norms and, more notably for this conversation, very different consequences.
I think the most notable difference is that grand larceny actually deprived someone of something they would have otherwise had, while pirating something you couldn't afford to buy doesn't because there was no circumstance in which they were getting the money and piracy doesn't involve taking anything from them...
> But nobody was ever going to that
If this is a choice between risking to pay 1.5 billion or just paying 15 mil safely, they might.
Option 1: $183B valuation, $1.5B settlement.
Option 2: near-$0 valuation, $15M purchasing cost.
To an investor, that just looks like a pretty good deal, I reckon. It's just the cost of doing business - which in my opionion is exactly what is wrong with practices like these.
> which in my opionion is exactly what is wrong with practices like these.
What's actually wrong with this?
They paid $1.5B for a bunch of pirated books. Seems like a fair price to me, but what do I know.
The settlement should reflect society's belief of the cost or deterrent, I'm not sure which (maybe both).
This might be controversial, but I think a free society needs to let people break the rules if they are willing to pay the cost. Imagine if you couldn't speed in a car. Imagine if you couldn't choose to be jailed for nonviolent protest.
This isn't some case where they destroyed a billion dollars worth of pristine wilderness and got off with a slap on the wrist.
I'm still sore that I got punished repeatedly for not doing my homework but they still insisted that I do the homework. "I'll mark it zero! You'll fail!" "OK." "Fucking do the work you little shit"
a lesson for us all
I agree to some extent, but there is a slippery slope to “no rules apply to the rich”.
I do agree that in the case of victimless crimes, having some ability to recompensate for damages instead of outright ban the thing, means that we can enact many massively net-positive scenarios.
Of course, most crimes aren’t victimless and that’s where the negative reactions are coming from (eg company pollutes the commons to extract a profit).
> I think a free society needs to let people break the rules if they are willing to pay the cost
so you don't think super rich people should be bound by laws at all?
Unless you made the cost proportional to (maybe expontial to) somebody's wealth, you would be creating a completely lawless class who would wreak havoc on society.
Hate to break it to you, but that's currently the world we live in. And yes, it sucks.
It's not about money. It's about time.
What you describe is in fact what Waymo has had, of chosen to, deal with. They didn't go for an end run around regulations related to vehicles on public roads. They committed to driverless vehicles and worked with local governments to roll it out as quickly as regulators were willing to allow.
Uber could have made the same decision and worked with regulators to be allowed into markets one at a time. It was an intentional choice to lean on the fact that Uber drivers blended into traffic and could hide in plain sight until Uber had enough market share and customer base to give them leverage.
Google did.
> Rainbows End was prescient in many ways.
Agreed. Great book for those looking for a read: https://www.goodreads.com/book/show/102439.Rainbows_End
The author, Vernor Vinge, is also responsible for popularizing the term 'singularity'.
RIP to the legend. He has a lot of really fun ideas spread across his books.
I didn't realize Vernor Vinge had passed away... Sad TIL
I got to meet him and person and tell him that his books (along with The Coming Technological Singularity) had a huge influence on my decision to go into ML. He seemed pleased. I just wish he had wrapped up the Fire Upon the Deep series.
There was a nice discussion & nostalgia at the time (1151 points, 2024, 320 comments) https://news.ycombinator.com/item?id=39775304
Cookie monster is his strongest work. It has a VIBE.
Reminds me of permutation city
One of my favorites
To be even more clear - this is a settlement, it does not establish precedent, nor admit wrongdoing. This does not establish that training is fair use, nor that scanning books is fine. That's somebody else's battle.
Right, the settlement doesn't.
However, the judge already ruled on the only important piece of this legal proceeding:
> Alsup ruled in June that Anthropic made fair use of the authors' work to train Claude...
I suspect that ruling legally gets wiped off the books by the settlement since the case gets dismissed, no?
Even if the ruling legally remains in place after the settlement, district court rulings are at most persuasive precedent and not binding precedent in future cases, even ones handled by the same court. In the US federal court system, only appellate rulings at either the circuit court of appeals level or the Supreme Court level are binding precedent within their respective jurisdictions.
The ruling also doesn’t establish precedent, because it is a trial court ruling, which is never binding precedent, and under normal circumstances can’t even be cited as persuasive precedent, and the settlement ensures there will be no appellate ruling.
Which is very important for e.g. the NYT lawsuit against OpenAI. Basically there’s now precedent that training AI models on text and them producing output is not copyright infringement.
Judge Alsup’s ruling is not binding precedent, no.
> Buying used copies of books
It remains deranged.
Everyone has more than a right to freely have read everything is stored in a library.
(Edit: in fact initially I wrote 'is supposed to' in place of 'has more than a right to' - meaning that "knowledge is there, we made it available: you are supposed to access it, with the fullest encouragement").
> Everyone has more than a right to freely have read everything is stored in a library.
Every human has the right to read those books.
And now, this is obvious, but it seems to be frequently missed - an LLM is not a human, and does not have such rights.
By US law, cccording to Author's Guild vs Google[1] on the Google book scanning project, scanning books for indexes is fair use.
Additionally:
> Every human has the right to read those books.
Since when?
I strongly disagree - knowledge should be free.
I don't think the author's arrangement of the words should be free to reproduce (ie, I think some degree of copyright protection is ethical) but if I want to use a tool to help me understand the knowledge in a book then I should be able to.
[1] https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....
Huh?
I think he implies that because one can borrow hypothetically any book for free from a library, one could use them for legal training purposes, so the requirement of having your own copy should be moot
Libraries aren’t just anarchist free for alls they are operating under licensing terms. Google had a big squabble with the university of Illinois Urbana Champaign research library before finally getting permission to scan the books there. Guess what, Google has the full text but books.google.com only shows previews, why is an exercise to the reader literally
Libraries are neither anarchist free for alls nor are they operating under licensing terms with regards to physical books.
They're merely doing what anyone is allowed to with the books that they own, loaning them out, because copyright law doesn't prohibit that, so no license is needed.
Yup. And if Anthropic CEO or whoever wants to drive down to the library and check out 30 books (or whatever the limit is), scan them, and then return them that is their prerogative I guess.
Scanning (copying) is¹ not allowed. Reading is.
What is in a library, you can freely read. Find the most appropriate way. You do not need to have bought the book.
¹(Edit: or /may/ not be allowed, see posts below.)
Scanning is, under the right circumstances, allowed in the US, at least per the Second Circuit appeals court (Connecticut, New York, Vermont): https://en.wikipedia.org/wiki/Authors_Guild%2C_Inc._v._Googl....
There are no terms and conditions attached to library books beyond copyright law (which says nothing about scanning) and the general premise of being a library (return the book in good condition on time or pay).
Copyright law in the USA may be more liberal about scanning than other jurisdictions (see the parallel comment from gpm), which expressly regulate the amount of copying of material you do not own as an item.
The jurisdictions I'm familiar with all give vague fair use/fair dealing exceptions which would cover some but not all copying (including scanning) with less than clear boundaries.
I'd be interested to know if you knew of one with bright line rules delineating what is and isn't allowed.
Afaik to scan a book you need to destroy it by cutting the spine so it can feed cleanly into the scanner. Would incur a lot of fines.
That's what they did. They also destroyed books worth millions in the process.
They didn't think it would be a good idea to re-bind them and distribute it to the library or someone in need.
To be clear, they destructively scanned millions of books which in total were worth millions of dollars.
They did not destroy old, valuable books which individually were worth millions.
https://arstechnica.com/ai/2025/06/anthropic-destroyed-milli...
Nah, that's just if you want archival-quality scans. "Good enough for OCR" is a much lower bar.
Anthropic hired the books scanning guy from Google for 1M+ usd to do just that (open the binds).
I wonder what Aaron Swartz would think if he lived to see the era of libgen.
He died (2013) after libgen was created (2008).
I had no idea libgen was that old, thanks!
Yeah but did he die before anybody actually knew about it?
I knew about library genesis by 2012. It was at least 10 TiB large by then, IIRC. With the amount of Russian language content I got the impression it was more popular in that sphere, but an impressive collection for anyone and not especially secret.
Is lib still around anymore. I can't find any functioning urls
Recent MEGATHREAD on status of libgen and alternatives
https://www.reddit.com/r/libgen/comments/1n4vjud/megathread_...
It's in the megathread linked in this comment, but I want to specifically point to https://open-slum.org/ which is basically a status page for different sites dedicated to this purpose, and which I've found helpful.
Lol. I opened that link and was like "hmmm, that UI looks familiar".
I'm pretty sure that's just a frontend for Uptime Kuma https://github.com/louislam/uptime-kuma
Anna's Archive includes all of libgen and a lot more: https://en.wikipedia.org/wiki/Anna%27s_Archive
There are mirrors on its' wikipedia page: https://en.wikipedia.org/wiki/Library_Genesis
libgen.help is frequently updated
I believe that there's a reddit sub that keeps people up to date with what URLs are, or are not, functioning at any given point in time
Didn't he get in trouble for contributing to sci-hub before he died?
He got into trouble for breaking into an unsecured network closet at MIT and using MIT credentials to download a bunch of copyrighted content.
The whole incident is written up in detail, https://swartz-report.mit.edu/ by Hal Abelson (who wrote SICP among other things). It is a well-researched document.
I think the parent may be getting at why he was downloading the content. I don't know the answer to this. Maybe someone here does. What was he intending to do with the articles?
The report speculates to his motivations on page 31, but it seems to be unknown with any certainty.
Swartz, like many of us, see pay-for-access journals as an affront. I believe he wanted to "liberate" the content of these articles so that more people could read them.
Information may want to be free, but sometimes it takes a revolutionary to liberate it.
Yes, the ruling was a massive win for generative AI companies.
The settlement was a smart decision by anthropic to remove a huge uncertainty. 1.5 is not small, but it won't stop them or slow them significantly.
Google scanned many books quite a while ago, probably way more than LibGen. Are they good to use them for training?
If they legally purchased them I dont think why not. IIRC they did borrow from libraries so probably not every book in Google Books
Anthropic legally purchased the books it used to train its model according to the judge. And the judge said that was fine. Anthropic also downloaded books from a pirate site and the judge said that was bad -- even though the judge also said they didn't use those books for training....
They litigated this a while ago and my understanding was that they were able to claim fair use, but I'm no expert.
What I'm wondering is if they, or others, have trained models on pirated content that has flowed through their networks?
Books.Google.Com was deemed fair use because it only shows previews, not full downloads. Internet Archive is still under litigation iirc besides having owned a physical copy of every book they ever scanned (and keeping a copy in their warehouses) they let people read the whole thing.
I’m surprised Google hasn’t hit its competitors harder with the fact that they actually got permission to scan books from its partner libraries and Facebook and OpenAI just torrented books2/books3, but I guess they have aligned incentive to benefit from a legal framework that doesn’t look to closely at how you went about collecting source material
I imagine the problem there is they primarily scanned library books so I doubt they have the same copyright protections here as if they bought them
All those books were loaned by a library or purchased.
This is excellent news because it means that folks who pay for printed books and scan them also can train with their content. It's been said already that we've already trained on "the entire (public) internet." Printed books still hold a wealth of knowledge that could be useful in training models. And cheap, otherwise unwanted copies make great fodder for "destructive" scanning where you cut the spine off and feed it to a page scanner. There are online services that offer just that.
Yes, but the cat is out of the bag now. Welcome to the era of every piece of creative work coming with an EULA that you cannot train on it. It will be like clearing samples.
Paying $3,000 for pirating a ~$30 book seems disproportionate.
I feel like proportionality is related also to the scale. If a student pirates a textbook, I’d agree that 100x is excessive, but this is a corporation handsomely profiting off of mass piracy.
It’s crazy to imagine, but there was surely a document or slack message thread discussing where to get thousands of books, and they just decided to pirate them and that was OK. This was entirely a decision based on ease or cost, not based on the assumption it was legal. Piracy can result in jail time IIRC, so honestly it’s lucky the employee who suggested this, or took the action avoided direct legal liability.
Oh and I’m pretty sure other companies (meta) are in litigation over this issue, and the publishers knew that settlement below the full legal limit would limit future revenue.
> handsomely profiting
Well actively generating revenue at least.
Profits are still hard to come by.
Operating profits certainly but if you include investments the big players are raking it in aren't they?
Investment is debt lol. Maybe you can make the argument that you're increasing the equity value but you do have to eventually prove you're able to make money right? Maybe you don't, this system is pretty messed up after all.
what a fascinating software project someone had the oppertunity to work on.
Not if 100 companies did it and they all got away.
This is to teach a lesson because you cannot prosecute all thieves.
Yale Law Journal actually writes about this, the goal is to deter crime because in most cases damages cannot be recovered or the criminal will never be caught in the first place.
If in most cases damages cannot be recovered or the criminal will never be caught in the first place, then what is the lesson being taught? Doesn't that just create a moral hazard where you "randomly" choose who to penalize?
It's about sending a message.
The message being you’ll likely get away with it?
They're setting up a pretty simple EV calc:
(Probability of not getting away with it) 0.01 * (Cost if caught) 1000 = 10x (Expected Cost) = not worth it
The EV calculation completely goes away if you add a layer of limited liability corporation.
With the per-item limit for "willful infringement" being $150,000, it's a bargain.
And a low end of $750/item.
Were you not around when people were getting sued for running Napster?
Fines should be disproportionate at this scale. So it discourages other businesses from doing the same thing.
So they’re creating monopolies? The existing players were allowed to do it, but anyone that tries to do it now will be hit with a 1.5B fine?
As long as they haven't been bullied into the corporate equivalent of suicide by the "justice" system it's not disproportionate considering what happened to Aaron Schwartz.
If anything it's too little based on precedent.
Well it's willful infringement so a court would be entitled to add a punitive multiplier anyway. But this is something Anthropic agreed to, if that wasn't clear.
Realistically it will be $30 per book and $2,970 for the lawyers
That's not how class actions work. Ever.
In this specific case the settlement caps the lawyer fees at 25%, and even that is subject to the courts approval. In addition they will ask for $250k total ($50k / plaintiff) for the lead plaintiffs, also subject to the courts approval.
25% of 1.5B?
It is related to scalable mode training, however. Chopping the spine off books and putting the pages in an automated scanner is not scalable. And don't forget about the cost of 1) finding 2) purchasing 3) processing and 4) recycling that volume of books.
I guess companies will pay for the cheapest copies for liability and then use the pirated dumps. Or just pretend that someone lent the books to them.
> Chopping the spine off books and putting the pages in an automated scanner is not scalable.
That's how Google Books, the Internet Archive, and Amazon (their book preview feature) operated before ebooks were common. It's not scalable-in-a-garage but perfectly scalable for a commercial operation.
We hem and haw about metaphorical "book burning" so much we forget that books themselves are not actually precious.
The books that are destroyed in scanning are a small minority compared to the millions discarded by libraries every year for simply being too old or unpopular.
>we forget that books themselves are not actually precious.
Book burnings are symbolic (Unless you're in the world of Fareinheit 451). The real power comes from the political threat, not the fact that paper with words on them is now unreadable.
Well, the famous 1933-05-10 book burning did destroy the only copies of a lot of LGBT medical research, and destroying the last copy of various works was a stated intent of Nazi book burnings.
I don't think Google Books scanner chopped off the spine. https://linearbookscanner.org/ is the open design they released.
I remember them having a 3D page unwarping tech they built as well so they could photograph rare and antique books without hacking them apart.
Oh I didn't know that. That's wild
Wdym Rainbows End was prescient?
There's a scene early on where libraries are being destructively shredded, with the shreds scanned and reconstructed as digital versions.
> It’s important in the fair use assessment to understand that the training itself is fair use,
I think that this is a distinction many people miss.
If you take all the works of Shakespeare, and reduce it to tokens and vectors is it Shakespeare or is it factual information about Shakespeare? It is the latter, and as much as organizations like the MLB might want to be able to copyright a fact you simply cannot do that.
Take this one step further. IF you buy the work, and vectorize it, thats fine. But if you feed it in the vectors for Harry Potter so many times that it can reproduce half of the book, it becomes a problem when it spits out that copy.
And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.
If you use an LLM to generate your code at work, can you leave with that code when you quit? Does GPL3 or something like the Elastic Search license even apply if there is no copyright?
I suspect we're going to be talking about court cases a lot for the next few years.
The question is going to be how much human intellectual input there was I think. I don't think it will take much - you can write the crappiest novel on earth that is complete random drivel and you still have copyright on it.
So to me, if you are doing literally any human review, edits, control over the AI then I think you'll retain copyright. There may be a risk that if somebody can show that they could produce exactly the same thing from a generic prompt with no interaction then you may be in trouble, but let's face it should you have copyright at that point?
This is, however, why I favor stopping slightly short of full agentic development at this point. I want the human watching each step and an audit trail of the human interaction in doing it. Sure I might only get to 5x development speed instead of 10x or 20x but that is already such an enormous step up from where we were a year ago that I am quite OK with that for now.
Yes. Someone on this post mentioned that switzerland allows downloading copyrightable material but not distributing them.
So things get even more dark because what becomes distribution can have a really vague definition and maybe the AI companies will only follow the law just barely, just for the sake of not getting hit with a lawsuit like this again. But I wonder if all this case did was maybe compensate the authors this one time. I doubt if we can see a meaningful change towards AI companies attitude's towards fair use/ essentially exploiting authors.
I feel like that they would try to use as much legalspeak as possible to extract as much from authors (legally) without compensating them which I feel is unethical but sadly the law doesn't work on ethics.
Switzerland has five main collecting societies: ProLitteris for literature and visual arts, the SSA (Société Suisse des Auteurs) for dramatic works, the SUISA for music, Suissimage for audiovisual works, and SWISSPERFORM for related rights like those of performers and broadcasters. These non-profit societies manage copyright and related rights on behalf of their members, collecting and distributing royalties from users of their works.
Note that the law specifically regulates software differently, so what you cannot do is just willy nilly pirate games and software.
What distribution means in this case is defined in the swiss law. However swiss law as a whole is in some ways vague, to leave a lot up to interpretation by the judiciary.
> compensate the authors this one time.
I would assume it would compensate the publisher. Authors often hand ownership to the publisher; there would be obvious exceptions for authors who do well.
> And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.
This seems too cute by half, courts are generally far more common sense than that in applying the law.
This is like saying using `rails generate model:example` results in a bunch of code that isn't yours, because the tool generated it according to your specifications.
The example is a real legal case afaik, or perhaps paraphrased from one (don’t think it was a monkey - an ape? An elephant?).
I’d guess the legal scenario for `rails generate` is that you have a license to the template code (by way of how the tool is licensed) and the template code was written by a human so licensable by them and then minimally modified by the tool.
I think you're thinking of this case [1], it was a monkey, it wasn't a painting but a selfie. A painting would have only made the no-copyright argument stronger.
[1] https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...
I don’t think the code you get from rails generate is yours. Certainly not by way of copyright, which protects original works of authorship and so if it’s not original, it’s not copyrightable, and yes it’s been decided in US courts that non-human-authorship doesn’t count as creative.
> courts are generally far more common sense than that in applying the law.
'The Board’s decision was later upheld by the U.S. District Court for the District of Columbia, which rejected the applicant’s contention that the AI system itself should be acknowledged as the author, with any copyrights vesting in the AI’s owner. The court further held that the CO did not act arbitrarily or capriciously in denying the application, reiterating the requirement that copyright law requires human authorship and that copyright protection does not extend to works “generated by new forms of technology operating absent any guiding human hand, as plaintiff urges here.”' From: https://www.whitefordlaw.com/news-events/client-alert-can-wo...
The court is using common sense when it comes to the law. It is very explicit and always has been... That word "human" has some long standing sticky legal meaning (as opposed to things that were "property").
I mean, sort of. The issue is that the compression is novel. So anything post tokenization could arguably be considered value add and not necessarily derivative work.
Thanks for the reminder that what the Internet Archive did in its case would have been legal if it was in service of an LLM.
Many things become legal when the perpetrator has money.
The golden rule:
He who has the gold makes the rules
LLM’s are turning out to be a real get-out-of-legal-responsibilities card, hey?
I guess they must delete all models since they acquired the source illegally and benefitted from it, right? Otherwise it just encourages others to keep going and pay the fines later.
In a prior ruling, the court stated that Anthropic didn't train on the books subject to this settlement. The record is that Anthropic scanned physical books and used those for training. The pirated books were being held in a general purpose library and were not, according to the record, used in training.
So how did they profit off the pirated books?
According to the judge, they didn't. The judge said they stored those books in a general purpose library for future use just in case they decided to use them later. It appears the judge took much issue with the downloading of "pirated content." And Anthropic decided to settle rather than let it all play out more.
That is something which is extremely difficult to prove from either side.
It is 500,000 books in total so did they really scan all those books instead of using the pirated versions? Even when they did not have much money in the early phases of the model race?
The 500,000 number is the number of books that are part of the settlement. If they downloaded all of Libgen and the other sources it was more like >7Million. But it is a lot of work to determine which books can legitimately be part of the lawsuit. For example, if any of the books in the download weren't copyright (think self published) or not protected under US copyright law (maybe a book only published in Venezula) or it isn't clear who own the copyright then that copyright owner cannot be part of the class. So it seems like the 500,000 number is basically the smaller number of books for which the lawyers for the plaintiff felt they could most easily prove standing.
> Buying used copies of books, scanning them, and training on it is fine.
Buying used copies of books, scanning them, and printing them and selling them: not fair use
Buying used copies of books, scanning them, and making merchandise and selling it: not fair use
The idea that training models is considered fair use just because you bought the work is naive. Fair use is not a law to leave open usage as long as it doesn’t fit a given description. It’s a law that specifically allows certain usages like criticism, comment, news reporting, teaching, scholarship, or research. Training AI models for purposes other than purely academic fits into none of these.
Buying used copies of books, scanning them, training an employee with the scans: fair use.
Unless legislation changes, model training is pretty much analogous to that. Now of course if the employee in question - or the LLM - regurgitates a copyrighted piece verbatim, that is a violation and would be treated accordingly in either case.
> Buying used copies of books, scanning them, training an employee with the scans: fair use.
Does this still hold true if multiple employees are "trained" from scanned copies at the same time?
Simultaneously I guess that would violate copyright, which is an interesting point. Maybe there's a case to be made there with model training.
Regardless, the issue could be resolved by buying as many copies as you have concurrent model training instances. It isn't really an issue with training on copyrighted work, just a matter of how you do so.
It fits the basicmost fair use: reading them. Current "training" can be considered as a gross form of reading.
The purpose and character of AI models is transformative, and the effect of the model on the copyrighted works used in the model is largely negligible. That's what makes the use of copyrighted works in creating them fair use.
Settlement Terms (from the case pdf)
1. A Settlement Fund of at least $1.5 Billion: Anthropic has agreed to pay a minimum of $1.5 billion into a non-reversionary fund for the class members. With an estimated 500,000 copyrighted works in the class, this would amount to an approximate gross payment of $3,000 per work. If the final list of works exceeds 500,000, Anthropic will add $3,000 for each additional work.
2. Destruction of Datasets: Anthropic has committed to destroying the datasets it acquired from LibGen and PiLiMi, subject to any legal preservation requirements.
3. Limited Release of Claims: The settlement releases Anthropic only from past claims of infringement related to the works on the official "Works List" up to August 25, 2025. It does not cover any potential future infringements or any claims, past or future, related to infringing outputs generated by Anthropic's AI models.
Don't forget: NO LEGAL PRECEDENT! which means, anybody suing has to start all over. You only settle in this scenario/point if you think you'll lose.
Edit: I'll get ratio'd for this- but its the exact same thing google did in it's lawsuit with Epic. They delayed while the public and courts focused in apple (oohh, EVIL apple)- apple lost, and google settled at a disadvantage before they had a legal judgment that couldn't be challenged latter.
I thought the courts decided against Google in Google vs Epic? It was even appealed and upheld. Are you thinking of another case? https://en.m.wikipedia.org/wiki/Epic_Games_v._Google
A full case is many more years of suits and appeals with high risks, so its natural to settle which obviously means no precedent
Or, if you think your competition, also caught up in the same quagmire, stands to lose more by battling for longer than you did?
A valid touche! I still think google went with delaying tactics as public and other pressures forced Apple's case forward at greater velocity. (Edit: implicit "and then caved when apple lost"... because they're the same case)
Thank you. I assumed it would be quicker to find the link to the case PDF here, but your summary is appreciated!
Indeed, it is not only payout, but the destruction of the datasets. Although the article does quote:
> “Anthropic says it did not even use these pirated works,” he said. “If some other generative A.I. company took data from pirated source and used it to train on and commercialized it, the potential liability is enormous. It will shake the industry — no doubt in my mind.”
Even if true, I wonder how many cases we will see in the near future.
So... it would be a lot cheaper to just buy all of the books?
Yes, much.
And they actually went and did that afterwards. They just pirated them first.
Where can I find source that says Anthropic bought the pirated books afterwards? I haven't seen this in any official document.
Also, do we know if the newer models were trained without the pirated books?
> Where can I find source that says Anthropic bought the pirated books afterwards? I haven't seen this in any official document.
https://storage.courtlistener.com/recap/gov.uscourts.cand.43...
> Also, do we know if the newer models were trained without the pirated books?
I'm pretty sure we do but I couldn't swear to it or quickly locate a source.
What is the HN term for this? "Bootstrapping" your start up? Or is it "growth-hacking" it?
The latter (I know you're joking, but...)
Bootstrapping in the startup world refers to starting a startup using only personal resources instead of using investors. Anthropic definitely had investors.
Bookstrapping
The permission to buy them was already settled by Google Books in the 00's.
They did, but only after they pirated the books to begin with.
Few. This settlement potentially weakens all challenges to the use of copyrighted works in training LLM's. I'd be shocked if behind closed doors there wasn't some give and take on the matter between Executives/investors.
A settlement means the claimants no longer have a claim, which means if they're also part of- say, the New York Times affiliated lawsuit- they have to withdraw. A neat way of kneecapping a country wide decision that LLM training on copy written material is subject to punitive measures don't you think?
That's not even remotely true. Page 4 of the settlement describes released claims which only relate to the pirating of books. Again, the amount of misinformation and misunderstanding I see in copyright related threads here ASTOUNDS.
Did you miss the "also" how about "adjacent"? I won't pretend to understand the legal minutia, but reading the settlement doesn't mean you do either.
In my experience&training in a fintech corp- Accepting a settlement in any suit weakens your defense- but prevents a judgement and future claims for the same claims from the same claimants (a la double jeopardy). So, again- at minimum- this prevents an actual judgement. Which, likely would be positive for the NYT (and adjacent) cases.
Only 500,000 copyrighted works?
I was under the impression they had downloaded millions of books.
I’m an author, can I get in on this?
I had the same question.
It looks like you'll be able to search this site if the settlement is approved:
> https://www.anthropiccopyrightsettlement.com/
If your work is there, you qualify for a slice of the settlement. If not, you're outta luck.
Wait so they raised all that money just to give it to publishers?
Can only imagine the pitch, yes please give us billions of dollars. We are going to make a huge investment like paying of our lawsuits.
From the article:
> Although the payment is enormous, it is small compared with the amount of money that Anthropic has raised in recent years. This month, the start-up announced that it had agreed to a deal that brings an additional $13 billion into Anthropic’s coffers. The start-up has raised a total of more than $27 billion since its founding in 2021.
Maybe small compared to the money raised, but it is in fact enormous compared to the money earned. Their revenue was under $1b last year and they projected themselves as likely to make $2b this year. This payout equals their average yearly revenue of the last two years.
I thought they were projecting 10B and said a few months ago they have already grown from a 1B to 4B run rate
Here is an article that discusses why those numbers are misleading[1]. From a high level, "run rate" numbers are typically taking a monthly revenue number and multiplying it by 12 and that just isn't an accurate way to report revenue for reasons outlined in that article. When it comes to actual projections for annual revenue, they have said $2b is the most likely outcome for their 2025 annual revenue.
[1] - https://www.wheresyoured.at/howmuchmoney/
But what are the profits? 1.5B is a huge amount, no matter what, especially if you’re committing to destroying the datasets as well. That implies you basically used 1.5B for a few years of additional training data, a huge price.
It doesn't matter if they end up in chapter 11... If it kneecaps all the other copyright lawsuits. I won't pretend to know the exact legal details. But I am (unfortunately) old enough that this isn't my first "giant corporation benefits from legally and ethically dubious copyright adjacent activities, gets sued, settles/wins." (Cough, google books)
Personally I believe in the ideal scenario (for the fed govt.) these firms will develop the tech. The fed will then turn around and want those law suits to win - effectively gutting the firms financially and putting the tech in the hands of the public sector.
You never know, its a game of interests and incentives - one thing for sure - does does the fed want the private sector to own and control a technology of this kind? Nope.
maybe I’m bad at math but paying >5% of your capital raised for a single fine doesn’t seem great from a business perspective
If they are going to be making Billions in net income every year going forward, as many years as analysts can make projections for, and using these works allowed them to GTM faster/quicker/gain advantage against competitors, then it is quite great from a business prospective.
If it allowed them to move faster than their completion, I imagine management would consider it money well spent. They are expected to spend absurd amounts of money to get ahead. They were never expected to spend money efficiently if it meant taking additional months/years to get results.
Someone here commented saying they claimed they did not even use it for training, so apparently it was useless.
It's VC money, I don't think anyone believes it's real money
If it weren't, why are we taking it as legal tender? I certainly wouldn't mind being paid in VC money
Yeah it does, cost of materials is way more than that if they were building something physical like a new widget or something. Same idea, they paid for their raw materials.
The money they don't pay out in settlements goes to Nvidia.
You're joking, but that's actually a good pitch. There was a significant legal issue hanging over their heads, with some risk of a potentially business-ending judgment down the line. This makes it go away, which makes the company a safer, more valuable investment. Both in absolute terms and compared to peers who didn't settle.
It just resolves their liability with regards to books they purported they did not even train the models on, which is all that was left in this case after summary judgment. Sure the potential liability was company ending, but it's all a stupid business decision when it is ultimately for books they did not even train on.
It basically does nothing for them besides that. Given the split decisions so far, I'm not sure what value the Alsup decision is going to bring to the industry, moving forward, when it's in the context of books that Anthropic physically purchased. The other AI cases are generally not fact patterns where the LLM was trained with copyrighted materials that the AI company legally purchased copies of.
Isn't that how the whole system operates? Everyone is a conduit to allow rich people to enrich themselves further. The amount and quality of opportunities any individual receives are proportional to how well it serves existing capital.
So long as there is an excuse to justify money flows, that's fine, big capital doesn't really care about the excuse; so long as the excuse is just persuasive enough to satisfy the regulators and the judges.
Money flows happen independently, then later, people try to come up with good narratives. This is exactly what happened in this case. They paid the authors a lot of money as a settlement and agreed on a narrative which works for both sets of people; that training was fine, it's the pirating which was a problem...
It's likely why they settled; they preferred to pay a lot of money and agree on some false narrative which works for both groups rather than setting a precedent that AI training on copyrighted material is illegal; that would be the biggest loss for them.
> Isn't that how the whole system operates? Everyone is a conduit to allow rich people to enrich themselves further. The amount and quality of opportunities any individual receives are proportional to how well it serves existing capital.
Yes, and FWIW that's very succinctly stated.
Sort of.
Some individuals in society find a way through that and figure out a way to strategically achieve their goals. Rare though.
They wanted to move fast and break things. No one made them.
If you are an author here are a couple of relevant links:
You can search LibGen by author to see if your work is included. I believe this would make you a member of the class: https://www.theatlantic.com/technology/archive/2025/03/searc...
If you are a member of the class (or think you are) you can submit your contact information to the plaintiff's attorneys here: https://www.anthropiccopyrightsettlement.com/
I can't help but feel like this is a huge win for Chinese AI. Western companies are going to be limited in the amount of data they can collect and train on, and Chinese (or any foreign AI) is going to have access to much more and much better data.
The West can end the endless pain and legal hurdles to innovation by limiting the copyright. They can do it if there is will to open up the gates of information to everyone. The duration of 70 years after death of the author or 90 years for companies is excessively long. It should be ~25 years. For software it should be 10 years.
And if AI companies want recent stuff, they need to pay the owners.
However, the West wants to infinitely enrich the lucky old people and companies who benefited from the lax regulations at the start of 20th century. Their people chose to not let the current generations to acquire equivalent wealth, at least not without the old hags get their cut too.
This is sad for open source AI, piracy for the purpose of model training should also be fair use because otherwise only the big companies who can afford to pay off publishers like Anthropic will be able to do so. There is no way to buy billions of books just for model training, it simply can't happen.
Fair use isn't about how you access the material, its about what you can do with it after you legally access it. If you don't legally access it, the question of fair use is moot.
Hence, "should"
It’s the sign of a health economy when we respect the creation of content.
This is a settlement. It does not set a precedent nor even admit to wrongdoing.
> otherwise only the big companies who can afford to pay off publishers like Anthropic will be able to do so
Only well funded companies can afford to hire a lot of expensive engineers and train AI models on hundreds of thousands of expensive GPUs, too.
Something tells me many the grassroots LLM training people are less concerned about legality of their source training set than the big companies anyway.
This implies training models is some sort of right.
No, it implies that having the power to train AI models exclusively consolidated into a handful of extremely powerful companies is bad.
Isn’t that already the case, with the capacity required to train these models?
It implies that people want everyone to do this when it's clear no one should do it. I'm not exactly a fan of "this isn't profitable for small businesses to steal from so we should make it so everyone should steal".
Piracy is not stealing. I don't know why everyone on HN suddenly turned into a copyright hawk, only big companies benefit from our current copyright regime, like Disney and their lobbying for increasing its length.
> only big companies benefit from our current copyright regime
You’ve never authored, created, or published something? Never worked for a company that sells something protected by copyright?
> Never worked for a company that sells something protected by copyright?
I.e. never created software in exchange of money.
All my works are open source or in the public domain. I don't like copyright for a reason.
That's true. Those handful of companies shouldn't get to do it either.
No. It means model training is transformative enough to be fair use. They should just be asked to pay them back plus reimbursement/punishment, say pay 10x the price of the pirated books
I wonder how much it would cost to buy every book that you'd want to train a model.
500,000 x $20 = $10 million
Obviously there would be handling costs + scanning costs, so that’s the floor.
Maybe $20 million total? Plus, of course, the time it would take to execute.
I wish the hn rules were more flexible because I would write the best comment to you right now.
One thing that comes to mind is...
Is there a way to make your content on the web "licensed" in a way where it is only free for human consumption?
I.e. effectively making the use of AI crawlers pirating, thus subject to the same kind of penalties here?
Yes to the first part. Put your site behind a login wall that requires users to sign a contract to that effect before serving them the content... get a lawyer to write that contract. Don't rely on copyright.
I'm not sure to what extent you can specify damages like these in a contract, ask the lawyer who is writing it.
Contracts generally require an exchange of consideration (something of value, like money).
If you put a “contract” on your website that users click through without paying you or exchanging value with you and then you try to collect damages from them according to your contract, it’s not going to get you anywhere.
The consideration the viewer received was access to your private documents.
The consideration you received was a promise to refrain from using those documents to train AI.
I'm not a lawyer, but by my understanding of contract law consideration is trivially fulfilled here.
Maybe some kind of captcha like system could be devised that could be considered a security measure under the DMCA and not allowed to be circumvented. Make the same content available under a licence fee through an API.
I'd argue you don't actually want this! You're suggesting companies should be able to make web scraping illegal.
That curl script you use to automate some task could become infringing.
>I'd argue you don't actually want this! You're suggesting companies should be able to make web scraping illegal.
At this point, we do need some laws regulating excessive scraping. We can't have the ineternet grind to a halt over everyone trying to drain it of information.
No. Neither legally or technically possible.
I'm sure one can try, but copyright has all kinds of oddities and carve-outs that make this complicated. IANAL, but I'm fairly certain that, for example, if you tried putting in your content license "Free for all uses public and private, except academia, screw that ivory tower..." that's a sentiment you can express but universities are under no obligation legally to respect your wish to not have your work included in a course presentation on "wild things people put in licenses." Similarly, since the court has found that training an LLM on works is transformative, a license that says "You may use this for other things but not to train an LLM" couldn't be any more enforceable than a musician saying "You may listen to my work as a whole unit but God help you if I find out you sampled it into any of that awful 'rap music' I keep hearing about..."
The purpose of the copyright protections is to promote "sciences and useful arts," and the public utility of allowing academia to investigate all works(1) exceeds the benefits of letting authors declare their works unponderable to the academic community.
(1) And yet, textbooks are copyrighted and the copyright is honored; I'm not sure why the academic fair-use exception doesn't allow scholars to just copy around textbooks without paying their authors.
See kids? Its okay to steal if you steal more money than the fine costs.
They're paying $3000 per book. It would've been a lot cheaper to buy the books (which is what they actually did end up doing too).
That metaphor doesn't really work. It's a settlement, not a punishment, and this is payment, not a fine. Legally it's more like "The store wasn't open, so I took the items from the lot and paid them later".
It's not the way we expect people to do business under normal circumstances, but in new markets with new products? I guess I don't see much actually wrong with this. Authors still get paid a price they were willing to accept, and Anthropic didn't need to wait years to come to an agreement (again, publishers weren't actually selling what AI companies needed to buy!) before training their LLMs.
It will be interesting to see how this impacts the lawsuits against OpenAI, Meta, and Microsoft. Will they quickly try to settle for billions as well?
It’s not precedent setting but surely it’ll have an impact.
I’m sure this’ll be misreported and wilfully misinterpreted because of the current fractious state of the AI discourse, but given the lawsuit was to do with piracy, not the copyright-compliance of LLMs, and in any case, given they settled out of court, thus presumably admit no wrongdoing, conveniently no legal precedent is established either way.
I would not be surprised if investors made their last round of funding contingent on settling this matter out of court precisely to ensure no precedents are set.
Anthropic certainly seems to be hoping that their competitors will have to face some consequences too:
>During a deposition, a founder of Anthropic, Ben Mann, testified that he also downloaded the Library Genesis data set when he was working for OpenAI in 2019 and assumed this was “fair use” of the material.
Per the NYT article, Anthropic started buying physical books in bulk and scanning them for their training data, and they assert that no pirated materials were ever used in public models. I wonder if OpenAI can say the same.
Maybe, though this lawsuit is different in respect to the piracy issue. Anthropic is paying the settlement because they pirated the books, not because training on copyrighted books isn’t fair use which isn’t necessarily true with the other cases.
Didn't Meta do the exact same thing?
https://www.tomshardware.com/tech-industry/artificial-intell...
That was my first though. While not legal precedent, it does sort of open the flood gates for others.
Wooo, I sure could use $3k right now and I've got something in the pirate libraries they scraped. Nice.
So if you buy the content legally and fine tune using it that's fair use?
I do not believe authors will see any of this money. I will change my mind when I see an email or check.
"$3,000 per work" seems like an incredibly good deal to license a book.
This was a very tactical decision by Anthropic. They have just received Series F funding, and they can now afford to settle this lawsuit.
OpenAI and Google will follow soon now that the precedent has been set, and will likely pay more.
It will be a net win for Anthropic.
As a published author who had works in the training data, can I take my settlement payout in the form of Claude Code API credits?
TBH I'm just going to plow all that money back into Anthropic... might was well cut out the middleman.
I wonder if Antrhopic's lawyers have enough of a sense of humor to take you up on that if you sent them an email asking...
How do legal penalties and settlements work internationally? Are entities in other countries somehow barred from filing similar suits with more penalties?
Maybe I would think differently if I was a book author but I can't help but think that this is ugly but actually quite good for humanity in some perverse sense. I will never, ever, read 99.9% of these books presumably but I will use claude.
(Everyone say it with me)
Thats a weird way for Anthropic to announce they're going out of business.
(Sorry, meta question: how do we insert in submissions that "'Also' <link> <link>..." below the title and above the comment input? The text field in the "submit" page creates a user's post when the "url" field is also filled. I am missing something.)
Why are they paying $3000 per book. Does anyone think these authors srll their books for that amount?
Copies of these books are for sale for much less than that - very very few books demand a price that high.
They're paying much more than the actual damages because US copyright law comes with statutory damages for infringement of registered works on top of actual damages, between $200 and $150,000 per work. And the two sides negotiated this as a fair settlement to reduce the risk of an unfavourable outcome.
It's better to ask for forgiveness than for permission.
Taken right from the VC's handbook.
I wonder who will be the first country to make an exception to copyright law for model training libraries to attract tax revenue like Ireland did for tech companies in the EU. Japan is part of the way there, but you couldn't do a common crawl type thing. You could even make it a library of congress type of setup.
As long as you're not distributing, it's legal in Switzerland to download copyrighted material. (Switzerland was on the naughty US/MPAA list for a while, might still be)
Is it distribution though if someone trains a model in switzerland through downloading copyrighted material, training AI on it and then distributing it...
Or what if not even distributing it but rather distributing the outputs of the LLM (so closed source LLM like anthropic)
I am genuinely curious as to if there is some gray area that might be exploited by AI companies as I am pretty sure that they don't want to pay 1.5B dollars yet still want to exploit the works of authors. (let's call a spade a spade)
Using copyrighted material to train AI is a legal grey zone. The nyt vs openAI case is litigating this. The anthropic settlement here is about how the material is obtained. If openAI wins their case and switzerland rules the same way I dont think there would be a problem
This might go down (I think) to be one of the most influential court cases to happen then.
We really are getting at some metaphysical / philosophical questions and maybe we will one day arrive at a question that just can't be answered (I think this is pretty close, right?) and then AI companies would do things freely without being accountable since sure you could take to the courts but how would you come to the decision...?
Another question though
So lets say that the nyt vs openAI case is going on, so in the meantime while they are litigating (lets say), could OpenAI still continue doing the same thing while the case is going on?
They also agreed to destroy the pirated books. I wonder how large of a portion of their training data comes from these shadow libraries, and if AI labs in countries that have made it clear they won't enforce anti-piracy laws against AI companies will get a substantial advantage by continuing to use shadow libraries.
They already, prior to this lawsuit, prior to serving public models, replaced this data set with one they made by scanning purchased books. Destroying the data set they aren't even using should have approximately zero effect.
Perhaps they'll quickly rent the whole contents of a few physical libraries and then scan them all
I thought 1.5 B is the penalty for one torrent, not for a couple million torrents.
At least if you're a regular citizen.
Make sure to grab the mother-of-all-torrents I guess if you're going to go that path. That way you get more bang for your 1.5B penalty.
A million torrents would cost 1,500 each.
> "The technology at issue was among the most transformative many of us will see in our lifetimes"
A judge making on a ruling based on his opinion of how transformative a technology will be doesn't inspire confidence. There's an equivocation on the word "transformative" here -- not just transformative in the fair use sense, but transformative as in world-changing, impactful, revolutionary. The latter shouldn't matter in a case like this.
> Companies and individuals who willfully infringe on copyright can face significantly higher damages — up to $150,000 per work
Settling for 2% is a steal.
> In June, the District Court issued a landmark ruling on A.I. development and copyright law, finding that Anthropic’s approach to training A.I. models constitutes fair use,” Aparna Sridhar, Anthropic’s deputy general counsel, said in a statement.
This is the highest-order bit, not the $1.5B in settlement. Anthropic's guilty of pirating.
Printing press, audio recording, movies, radio, television were also transformative. Did not get rid of copyright or actually brought them.
I feel it is insane that authors do not receive some sort of standard compensation for each training use. Say a few hundred to a few thousand depending on complexity of their work.
Why would they earn more from models reading their works than I would pay to read it?
Same reason why the enterprise edition is more expensive than personal. Companies have more money to give and usually use it to generate profit. Individuals do not.
Because ones doing the training are profiting from it. Ai is not a human with limited time. And it is also owned by a company not a legal person.
I might find argument of comparing it to human when it is fully legal person and cutting power to it or deleting is treated as murder. Before that it is just bullshit.
And fundamentally reason for copy right to exist is to support creators and to promote them to create more. In world where massively funded companies can freely exploit their work and even in many case fully substitute that principle is failed.
If I buy a book, learn something, and then profit from it, should I also be paying more than the original price to read the book?
> Ai is not a human with limited time
AI is also bound by time, physics, and limited capacity. It does certain things better or faster than us, it fails miserably at certain things we don't even think about being complex (like opening a door)
> And it is also owned by a company not a legal person.
For the purpose of legalities, companies and persons are relatively equivalent, regardless of the merits, it is how it is
> In world where massively funded companies can freely exploit their work and even in many case fully substitute that principle is failed.
They paid for the books after getting caught, the other companies are paying for the copyrighted training materials
>They paid for the books after getting caught, the other companies are paying for the copyrighted training materials
Are they paying reasonable compensation? Say like with streaming services, movie theatres, radio and tv stations. As a whole their model is much close to those than individuals buying books, cds or dvds...
You might even consider Theatrical License or Public Performance License. Paid even if you have memorized a thing...
LLMs are just bad technology that require massive amount of inputs so the authors cannot be compensated enough for it. And I fully believe they should be. And lot more than single copy of their work under entirely ill-fitting first-sale doctrine does.
> If I buy a book, learn something, and then profit from it, should I also be paying more than the original price to read the book?
Depends on how you do it. Clearly reading the book word from word is different from making a podcast talking about your interpretation of the book.
Does anyone know which models were trained on the pirated books? I would like to avoid using those models.
$3000 per work isn't a bad price. It seems insulting to the copy write holder.
So if a startup wants to buy book PDFs legally to use for AI purposes, any suggestions on how to do that?
Reach the publishers or resellers (like amazon for instance)
Give them this order : "I want to buy all your books as epub"
Pay and fetch the stuff
That's all
“Agrees” is a funny old word
Anyone have a link to the class action? I published a book and would love to know if I'm in the class.
Docket: https://www.courtlistener.com/docket/69058235/bartz-v-anthro...
Proposed settlement: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...
https://authorsguild.org/news/what-authors-need-to-know-abou... looks useful.
Deep research on Claude perhaps for some irony if you will.
For legal observers, Judge William Haskell Alsup’s razor-sharp distinction between usage and acquisition is a landmark precedent: it secures fair use for transformative generative AI while preserving compensation for copyright holders. In a just world, this balance would elevate him to the highest court of the land, but we are far from a just world.
This weirdly seems like its the best mechanism to buy this much data.
Imagine going to 500k publishers to buy it individually. 3k per book is way cheaper. The copyright system is turning into a data marketplace in front of our eyes
I suspect you could acquire and scan every readily purchasable book for much less than $3k each. Scanhouse for instance charges $0.15 per page for regular unbound (disassembled) books, plus $0.25 for supervised OCR, plus another dollar if the formatting is especially complex; this comes out to maybe $200-300 for a typical book. Acquiring, shipping, and disposing of them all would of course cost more, but not thousands more.
The main cost of doing this would be the time - even if you bought up all the available scanning capacity it would probably take months. In the meantime your competition who just torrented everything would have more high-quality training data than you. There are probably also a fair number of books in libgen which are out of print and difficult to find used.
It's a tiny amount of data relatively speaking. Much more expensive per token than almost any data source imaginable
Wait, I’m a published author, where’s my check
The court has to give preliminary approval to the settlement first. After that there should be a notice period during which the lawyers will attempt to reach out and tell you what you need to do to receive your money. (Not a lawyer, not legal advice).
You can follow the case here: https://www.courtlistener.com/docket/69058235/bartz-v-anthro...
You can see the motion for settlement (what the news article is about) here: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...
Thank you very much. There seems to be a lot of friction in this seemingly simple process…
For what it's worth the friction exists for a reason, conflicts of interest.
The lawyers suing Anthropic here will probably walk away with several hundred million dollars - they have won the lottery.
If they managed to extract twice as much money from Anthropic for the class, they'd walk away with probably twice as much... but winning the lottery twice isn't actually much better than winning the lottery once. Meanwhile $4500 is a lot more than $2250 (the latter is a reasonable estimate of how much you'll get per work after the lawyers cut). Which risks the lawyers settling for less than is in their clients best interests so that they can reliably get rich.
Personally (not a lawyer or anything) I think this settlement seems very fair, and I expect the court will approve it. But there's definitely been plenty of class actions in the past where lawyers really did screw over the class and (try to) settle for less than they should have to avoid risking going to trial.
Interesting. Maybe there should be an easier way to file class action lawsuits and collect it - in a cheaper and more efficient manner
Do they even have that much cash on hand?
They just raised $13B, so yes
https://www.anthropic.com/news/anthropic-raises-series-f-at-...
... in one economy and for specific authors and publishers. But the offence is global in impact on authors worldwide, and the consequences for other IPR laws remains to be seen.
$1.5B is a nothing but a handslap for the big gold rush companies.
It's less than 1% Anthropic's valuation -- a valuation utterly dependent on all the hoovering up of others' copyrighted works.
AFAICT, if this settlement signals that the typical AI foundation model company's massive-scale commercial theft doesn't result in judgments that wipe out a company (and its execs), then we have confirmation that is a free-for-all for all the other AI gold rush companies.
Then making deals to license rights, in sell-it-to-us-or-we'll-just-take-it-anyway deals, becomes only a routine and optional corporate cost reduction exercise, but not anything the execs will lose sleep over if it's inconvenient.
> It's less than 1% Anthropic's valuation
The settlement is real money though. Valuation is imaginary.
There’s alternatives to wiping out the company that could be fair. For example, a judgement resulting in a shares of the company or revenue shares in the future rather than a one time pay off.
Writers were the true “foundational” piece of LLMs, anyway.
If this is an economist idea of fair, where is the market?
If someone breaks into my house and steals my valuables, without my consent, then giving me stock in their burglary business isn't much of a deterrent to them and other burglars.
Deterrence/prevention is my real goal, not the possibly of a token settlement from whatever bastard rips me off.
We need the analogue of laws and police, or the analogue of homeowner has a shotgun.
I don't much like the idea of settling in stock, but I also think you're looking for criminal law here. Civil law, and this is a civil suit, is far more concerned with making damaged parties whole than acting as a deterrent.
I understand that intentional copyright infringement is a crime in the US, you just need to convince the DOJ to prosecute Anthropic for it...
Isn't this basically what Spotify did originally?
I can see a price hike incoming.
It was coming regardless of the case results.
[dupe] https://news.ycombinator.com/item?id=45142558
A terrible precedent that guarantees China a win in the AI race
Now how about Meta and their questionable means of acquiring tons of content?
Maybe it's time to get some Llama models copied before an overzealous court rules badly.
> A trial was scheduled to begin in December to determine how much Anthropic owed for the alleged piracy, with potential damages ranging into the hundreds of billions of dollars.
It has been admitted and Anthropic knew that this trial would totally bankrupt them had they said they were innocent and continued to fight the case.
But of course, there's too much money on the line, which means even though Anthropic settled (admitting guilt and profiting off of pirated books) they (Anthropic) knew there was no way they could win that case, and was not worth taking that risk.
> The pivotal fair-use question is still being debated in other AI copyright cases. Another San Francisco judge hearing a similar ongoing lawsuit against Meta ruled shortly after Alsup's decision that using copyrighted work without permission to train AI would be unlawful in "many circumstances."
The first of many.
They would only be wiped out if the court awarded the maximum statutory damages (or close to it). There was never any chance of that happening.
If it was a sure thing, then the rights holders wouldn't have accepted a settlement deal for a measly couple billion. Both sides are happier to avoid risking losing the suit.
Also knowing how pro corporate the legal system is piercing the veil and going after everyone holding the stock would have been unlikely. So getting 1,5 billion out of them likely could have been reasonable move. Otherwise they could have just burned all the money and flipped what was leftover to someone else, with uncertain price and horizon.
Wait, DID they admit guilt? A lot of times companies settle without admitting guilt.
Honestly, this is a steal for Anthropic.
I'm excited for the moment where these models are able to treat using copyrighted work in a fair-use way that pays out to authors the way Spotify does when you listen to a song. Why? Because authors recieving royalties for their works when they get used in some prompt would likely encourage them to become far more accepting towards LLMs.
Also passing on the cost to consumers of generated content since companies now would need to pay royalties on the back-end should also likely increase the cost of generating slop and hopefully push back against that trend.
This shouldn't just be books, but all written content, like scholarly journals and essays, news articles and blogs, etc.
I realize this is just wishful thinking, but there's got to be some nugget of aspirational desire to pay it forward.
Great. Which rich person is going to jail for breaking the law?
This isn't a criminal case so zero people of any financial position would end up in prison.
No one, rich or poor, goes to jail for downloading books.
If I walked into a store and stole $1000 of books I would go to jail. If a tech company steals countless thousands of dollars worth of books, someone should go to jail.
Stealing physical goods is not the same as downloading copyrighted material.
Are you sure? I think in some jurisdictions they would, according to the law.
Tell that to Aaron schwartz
Swartz wasn’t charged only for downloading copyrighted material, he was also charged with wire fraud and breaking and entering.
This settlement I guess could be a landmark moment. $1.5 billion is a staggering figure and I hope it sends a clear signal that AI companies can’t just treat creative work as free training data.
I mean the ruling does in fact find that treating this particular kind of creative work qualifies as fair use.
All the ai companies are still using books as training data. Theyre just finding the cheapest scanned copies they can get their hands on to cover their ass
I'm gonna say one thing. If you agree that something was unfairly taken from book authors, then the same thing was taken from people publishing on the web, and on a larger scale.
Book authors may see some settlement checks down the line. So might newspapers and other parties that can organize and throw enough $$$ at the problem. But I'll eat my hat if your average blogger ever sees a single cent.
The blogger’s content was freely available, this fine is for piracy.
This is not a fine, it's a settlement to recompense authors.
More broadly, I think that's a goofy argument. The books were "freely available" too. Just because something is out there, doesn't necessarily mean you can use it however you want, and that's the crux of the debate.
It's not the crux of this case. This is a settlement based on the judge's ruling that they books had been illegally downloaded. The same judge said that the training itself was not the problem – it was downloading the pirated books. It will be tough to argue that loading a public website is an illegal download.
But you can use copyrighted works for transformative works under the fair-use doctrine, and training was ruled to be fair use in the previous ruling.
Books aren't hosted publicly online free for anyone to access. The court seems to think buying a book and scanning it is fair use. Just using pirated books is forbidden. Blogs weren't accessed via pirating.
The settlement was for downloading the pirated books, not training from them. Unless they're paywalled it would be hard to argue the same for a blog.
It seems weird that there was legal culpability for downloading pirated books but not for training on them. At the very least, there is a transitive dependency between the two acts.
Other people have said that Anthropic bought the books later on, but I haven't found any official records for that. Where would I find that?
Also, does anyone know which Anthropic models were NOT trained on the pirated books. I want to avoid such models.
As far as anyone knows, no models were trained on the illegally downloaded books.