The Tragedy of Google Books (2017)

(theatlantic.com)

503 points | by lispybanana 9 months ago ago

192 comments

pvg 9 months ago

https://archive.is/rQ7Zb

[-]

neonate 8 months ago

https://web.archive.org/web/20170719004247/https://www.theat...

philipkglass 9 months ago

These Google scans are also available in the HathiTrust [1], an organization built from the big academic libraries that participated in early book digitization efforts. The HathiTrust is better about letting the public read books that have actually fallen into the public domain. I have found many books that are "snippet view" only on Google Books but freely visible on HathiTrust.

If you are a student or researcher at one of the participating HathiTrust institutions, you can also get access to scans of books that are still in copyright.

The one advantage Google Books still has is that its search tools are much faster and sometimes better, so it can be useful to search for phrases or topics on Google Books and then jump over to HathiTrust to read specific books surfaced by the search.

[1] https://www.hathitrust.org/

[-]

acidburnNSA 9 months ago

Hathitrust has been absolutely transformative for me, as an amateur nuclear enterprise historian.

[-]

germinalphrase 9 months ago

“…nuclear enterprise…”

As in, the business of running a nuclear energy plant?

[-]

acidburnNSA 9 months ago

Yes. Electricity, propulsion, desalination, space heating, etc.

[-]

dghlsakjg 9 months ago

The old Lia pocket warmer?

[-]

acidburnNSA 9 months ago

Yes!

dredmorbius 9 months ago

HathiTrust is a fine example of a repository which is in theory useful but in practice all but useless.

Participation is limited to tertiary academic institutions, and possibly only four-year (rather than two-year) ones. This excludes local (city/county) libraries, as well as primary/secondary (grammar / middle / high school in the US) libraries.

Even public-domain records cannot be downloaded in whole, but rather can be saved one page at a time as PDFs. I'm pretty sure that those interested in more useful archival will and/or have created automated tools to do so, but HathiTrust remains the most notable point-of-access for such works, and the additional generation of conversion and republication further degrades the quality of original-publication formats. (It's less a problem for regenerated works from OCR'd or manually-converted documents, but those of course lose all the characteristics of original publication.)

And of course, many materials still under copyright are not accessible to the general public at all, no matter how obscure. I'd run into a case of this some months back trying to get a date attribution of an Alan Watts lecture which had been posted to HN:

<https://news.ycombinator.com/item?id=41231047> (thread).

And my request still stands. Anyone with an academic affiliation who can check <https://catalog.hathitrust.org/Record/000678503> and see how it relates to this post (<https://news.ycombinator.com/item?id=41230841>) would have my gratitude.

[-]

joshuaissac 9 months ago

Even with institutional login, HathiTrust does not show the full text online for this journal. It only permits searching and then showing the page numbers of the matches, which can be done without logging in.

But I think this journal does not contain the date.

Searching for "his religion" (with quotation marks) in volume 6 via HathiTrust shows a single match on page 11. Searching for the same text via the Google Books link from your other post shows the following entry among a list of what I assume are lectures:

> 919 Jesus: His Religion, Or The Religion About Him ... 10.00 7.00

The first number is some kind of index or serial number. The second number is the cassette cost and the third is the reel cost. You can see the column headings by searching for the number 900.

Searching for "Watts" in the same book via Google Books shows the title of page 11, "New Alan Watts Lectures".

Searching for the year numbers, the matches on that page seem to be for some text about the indexing of works in MMRI-1970, 1971 and 1972, rather than a publication year.

[-]

dredmorbius 9 months ago

Thanks for checking.

And for confirming that HT is even more useless than I'd thought.

Again: Fuck copyright.

Eisenstein 9 months ago

You might want to look for:

Watts, Alan. Myth and Religion : the Edited Transcripts. First edition. Boston: Charles E. Tuttle Co., 1996.

It contains "Jesus - His Religion, Or the Religion About Him", which appears to be a very slightly different title from the work that you are searching for.

[-]

dredmorbius 9 months ago

I'd found that text at the time as noted here: <https://news.ycombinator.com/item?id=41235652>.

The text includes the transcripts, but doesn't include the original date(s) of delivery / publication. And it's published a quarter century after the initial records of the lecture.

As noted, I'd emailed the Alan Watts institute but have received no reply.

bgoated01 9 months ago

I just put in a request at my university library for that item. I'll let you know what it turns up.

[-]

dredmorbius 9 months ago

Thank you so much!

[-]

bgoated01 8 months ago

For some reason the copy my library has doesn't have that page the same as it appears on hathitrust. Other nearby pages match the snippets I see on google books, but that page does not, and none of the nearby pages do, either. So I'm afraid my search here is a dead end.

consf 9 months ago

[dead]

yonran 9 months ago

> Dan Clancy, the Google engineering lead on the project who helped design the settlement, thinks that it was a particular brand of objector—not Google’s competitors but “sympathetic entities” you’d think would be in favor of it, like library enthusiasts, academic authors, and so on—that ultimately flipped the DOJ.

I was at Google in 2009 on a team adjacent to Dan Clancy when he was most excited about the Authors’ Guild negotiations to publish orphan works and create a portal to pay copyright holders who signed up, and I recall that one opponent that he was frustrated at was Brewster Kahle of the Internet Archive, who filed a jealous amicus brief (https://docs.justia.com/cases/federal/district-courts/new-yo...) complaining that the Authors’ Guild settlement would not grant him access to publishing orphan works too. In my opinion Kahle was wrong; the existence of one orphan works clearinghouse would have encouraged Congress to grant more libraries access instead of doing nothing which is what actually happened in the 15 year since then. Instead of one company selling out-of-print but in-copyright books, or multiple organizations, no one is allowed to sell them today.

Since then, of course, Brewster Kahle launched an e-library of copyrighted books without legal authorization anyway which will probably be the death of the current organization that runs the Internet Archive. Tragic all around.

[-]

chambers 9 months ago

I wish the contradiction you spotted was clear on their Wikipedia page. It demonstrates how far back IA's management troubles go, and how their clean image was maybe just an image.

For me, I became concerned when they fibbed about why the Internet Archive Credit Union was liquidated. IA alleged it was shut down due to onerous regulations, but the government said IA actually never lived up to their goal of allowing local, low-income folk to sign-up for their service. https://ncua.gov/newsroom/press-release/2016/internet-archiv...

vintermann 9 months ago

The situation with IA is one thing, but the general negative attitude of libraries and archives to public access is something I've observed too, and it's depressing.

For instance, they can spend a lot of effort digitizing an archive they got from a business active from 1890 to 1970 - and then put it all in a single collection, which the public won't get access to until 2070. There's no reason to think the business handled sensitive personal information, but it's too much work to check, so they assume it did. They could classify individual documents according to whether they were actually from before 1920, but that's too much work too.

[-]

acdha 9 months ago

One thing to keep in mind is that libraries and archives have been budget starved for decades. You can often get a grant to acquire or digitize something but fewer donors are interested in paying lawyers to evaluate copyright exposure, which leads to conservative policies.

All of the actual librarians and archivists I know hate this situation - it’s not a job you take if you don’t want people to access things – but that tends to translate into requests for copyright exemptions.

A really big one is orphan works where they have things like digitized music which can’t even be linked to a known copyright holder because it’s unclear who owns it after decades of contract shuffles and acquisitions, where you could potentially solve the problem by changing copyright law to require periodic payments to maintain protected status so someone at, say, Sony would have to cut a check every year to say that they still want to protect some obscure old blues track from 1952 which they don’t even offer for sale any more. I especially liked the proposals linking that to availability for mainstream sale: say that there’s no charge for anything which is normally available on iTunes, Play, Amazon, etc. but you need to pay a fee for works which aren’t available.

ghaff 9 months ago

That’s the case with so much of this sort of thing. You also see it in cases like open sourcing proprietary software. You need to pay someone competent to do a thorough audit or you end up with headlines about so and so releasing PII or otherwise confidential information.

[-]

notpushkin 9 months ago

I think it could be a great charity btw. Get donations from public, talk companies into releasing their old software, hire auditors for them and maybe developers to get the result running without all the proprietary third-party components.

[-]

ghaff 9 months ago

Personally I’d rather see maintainers getting better compensated for creating new and widely useful software in cases where they don’t have a corporate sponsor. Most abandoned proprietary software is just old.

[-]

notpushkin 9 months ago

I’d love that too, of course. But some old things are great for their historical value – personally, I would be thrilled to see Winamp released properly, for example.

mastazi 9 months ago

This is an insightful comment and I thank you for sharing it but, after having looked at the brief you linked

> a jealous amicus brief that the Authors’ Guild settlement would not grant him access to publishing orphan works too

that's not a fair overview of the amicus brief, there are good points there about the process of notifying orphan works rights holders and about the risk of a monopolistic position. I do agree with you on this part though

> the existence of one orphan works clearinghouse would have encouraged Congress to grant more libraries access instead of doing nothing

Edit: I also agree with you that the way the IA subsequently created its e-library was not ideal.

[-]

yonran 9 months ago

> that's not a fair overview of the amicus brief, there are good points there about the process of notifying orphan works rights holders and about the risk of a monopolistic position

What I meant by “jealous” is that the Internet Archive’s interest was not to improve author notification or to protect foreign authors; it was to provide a competing service under similar or better terms than Google was able to negotiate without spending the time and money that Google did litigating. Kahle wanted what was in Google’s settlement.

And what I meant by “Kahle was wrong” is not that every argument that his lawyers thought up was false; I think the agreement was later amended to fix some issues. My point is that Kahle’s theory of change was wrong. He thought that when the settlement was rejected, then Google would push Congress to create an orphan works law which the Internet Archive could use to publish old books too. As he wrote in his op-ed, “We need to focus on legislation to address works that are caught in copyright limbo. … We are very close to having universal access to all knowledge. Let's not stumble now.” https://www.washingtonpost.com/wp-dyn/content/article/2009/0... As it turns out, the rejection of the class action settlement did not cause Congress to create an orphan works law. In retrospect, we would have been more likely to get an orphan works law if Google had been allowed to set up a proof of the concept, making the monopoly on orphan works temporary.

[-]

cxr 9 months ago

There's such a weird tone to your posts. It's as if they're meant to give the impression that Kahle had a substantial, if not singlehanded, influence over the outcome. In reality, his input probably didn't have even the impact that Kahle himself hoped for and the appropriate adjective to describe the effect is probably "negligible", if at all. It was a class action lawsuit with extremely dubious underpinnings where over 6,000 people wrote in to ask that they not be considered part of the class.

pastage 9 months ago

I think the biggest hurdle was lawyers from the big publishing houses, the only view I heard was "This deal will make Google to powerful", the deal would have had far reaching international effects. In the end we know that Creative Commons was wrong you can not fix copyright by playing along.

lokar 9 months ago

I would say it’s much worse then “not ideal”, they may have poisoned the well for decades to come.

[-]

adastra22 9 months ago

Maybe permanently, as societal stances on these sorts of issues tend to solidify over time. In a couple of generations the very idea of a library may be confined to history thanks to IA :(

[-]

sulam 9 months ago

That is far-fetched, science fabulist style thinking. The average person on the street does not know about Brewster or the IA, but they certainly do know about their local library and would object if it was to disappear.

[-]

immibis 9 months ago

I expect the average person has never visited a library since school

[-]

rdmond 9 months ago

In fact, in the US, around half the population goes to a library at least once a year

https://www.pewresearch.org/internet/2016/09/09/libraries-20... https://www.ala.org/news/2019/12/new-ala-report-gen-z-millen...

[-]

greentxt 9 months ago

"with a majority reporting that libraries ... play at least some role in helping them decide what information they can trust."

Sounds legit

tehjoker 9 months ago

why are you blaming IA for the aggressive and tight fisted behavior of publishers?

shkkmo 9 months ago

> In my opinion Kahle was wrong; the existence of one orphan works clearinghouse would have encouraged Congress to grant more libraries access instead of doing nothing

Maybe. I think that is a pretty optimistic view of congress and our political process. I would argue that having a powerful, rich company with a monopoly to lose would have made passing such a law less likely, not more.

I do think we would have been better off with a Google monopoly on unpublished unclaimed books than with the lack of access we have today.

The article says:

> You’d get in a lot of trouble, they said, but all you’d have to do, more or less, is write a single database query. You’d flip some access control bits from off to on. It might take a few minutes for the command to propagate.

If it's so easy, I'm suprised nobody has done it and accepted the consequences. It seems one of the largest single positive impacts any person could make on the world. Once it's released, it'll never go back in the box. A modern Pandora.

[-]

sulam 9 months ago

In practice that is an obvious exaggeration for the purposes of making a point. It is probably simple enough to make the change, and it’s equally easy to change it back. One configuration makes you subject to massive lawsuits and the other doesn’t.

[-]

xp84 9 months ago

I suspect GP means exfiltrating the archive (hence the suggestion that it will never go back in the box) rather than setting them fully readable on GBS itself.

And I agree, someone should do that, because keeping books locked up that the rightsholders can’t be bothered to even attempt to make money from is asinine. Too bad our government doesn’t agree with me, so we’ll just have to wait for it to accidentally walk out the door.

kmeisthax 9 months ago

How would a settlement with the Authors' Guild cover orphan works? If the Authors' Guild is in a position to grant a license, then it's not an orphan work. The whole orphan works problem is that for a lot of valueless works, nobody knows who owns what.

[-]

xp84 9 months ago

Class actions can include people just by describing who makes up the class, that’s why the parties who worked on the proposed settlement came up with the plan. This method (uniquely) allowed settling with (essentially negotiating a deal with) all those unknown parties, a feat impossible otherwise.

The class as I understand it was the copyright holders of every book in a library, and if approved by the court they can all be legally said to have agreed to the settlement’s terms if they didn’t opt out. Now as with anything legal the whole thing depends on someone’s interpretation of whether it’s ok, but this was a plausible reading of Rule 23 of the Federal Rules of Civil Procedure.

One could file a class action suit on behalf of everyone who ordered a strawberry shake at a McDonald’s in 2023, without ever knowing who or exactly how many they are, and if a judge certified it as a class action and McDonalds cut a deal with those representing the class, the terms would bind them all (except those who explicitly excluded themselves).

jamiek88 9 months ago

That pandemic library was a huge, obvious over step by him.

It will have consequences far beyond the immediate lawsuit too.

The very concept has basically been iced for a generation and the net is only getting more locked down not less.

[-]

jmb99 9 months ago

Fortunately (by some definition of fortunately), most countries don’t agree on exactly how the web should be “locked down.” This benefits at least some people (like me) who live in countries who make also no effort to restrict what can be shoved down the internet tube, including from countries that don’t particularly care about western copyright law. Would it be nice to have a fully sanctioned pandemic library-style service? Absolutely. But I have never once looked for a textbook, paper, regular book, etc online and not found a copy for free. Usually takes the same amount of time or less compared to finding a copy on Amazon (if it’s currently in print), and almost always less time using my library’s clunky online ebook platform[1].

Is that legal? Technically yes, in my country. Is it ethical? Debatable, depending on who you’re asking. But for me personally, I have found it to be getting substantially easier to find high quality copies of copyrighted anything in the past 3-5 years compared to 10-15 years ago, so I don’t necessarily agree with the blanket statement that “the net is only getting more locked down.”

[1] I like to use the library as much as possible, if for nothing else than to increase usage numbers to marginally positively decrease the likelihood of finding cuts.

[-]

notpushkin 9 months ago

> Would it be nice to have a fully sanctioned pandemic library-style service? Absolutely.

This is basically LibGen / Anna’s Archive. A bit clunky around the download process (maybe things get better if you get a paid subscription though!), but overall it works pretty well.

[-]

southernplaces7 9 months ago

>This is basically LibGen / Anna’s Archive. A bit clunky around the download process

Not at all. You visit libgen, search for your book, find it (usually available) click one of the two available links for it, click to download, have it download. Done.

It couldn't be easier.

pessimizer 9 months ago

Thanks for making me aware of this. This guy's heart is clearly (to me) in the right place, but his understanding of power is seriously lacking. That's probably what gave him the hubris to create Wayback and IA, but he'll be absolutely dumbstruck when they shut it down.

[-]

kragen 9 months ago

He won't be surprised at all. His slogan is "governments burn libraries". He's been able to forestall that for a while, and even provide public access, but permanence of the IA as an institution was never in the cards, given its subversive goal: universal access to all human knowledge.

Guess where the first backup copy of the Internet Archive is located.

[-]

Yeul 9 months ago

Libraries are funded by the government. They've been diligently scanning books for decades but nobody has had even the slightest interest in that until their favourite hackerman showed up. Libertarian god complex is so tiresome.

[-]

dredmorbius 9 months ago

Some libraries are government-funded. Many are not.

That ranges from the personal book collection numbering from one to many thousands in private hands, private institutional libraries (the Mechanics Institute in San Francisco is one that comes to mind, many private universities and grammar schools have their own, as do numerous corporations, some of which are catalogued by Worldcat).

Preservation of Western culture, notably the Greek and Roman canons, as well as much literature and knowledge of the Jewish, Byzantine, and Islamic worlds, occurred through religious institutions. Though in some regards those were the governments of the time. Indian, Chinese, and other further East Asian collections were preserved through multiple means.

Book digitisation at US academic institutions (the University of Michigan being a major contributor to both Google Books and HathiTrust) has had its own exctremely combative relationship with commercial publishers, as has the US Library of Congress, which issues US copyright in the first place.

Avoid slurs, it's an HN guideline: <https://news.ycombinator.com/newsguidelines.html>.

kragen 9 months ago

Most libraries are not funded by a government. Your comment is not up to acceptable quality or civility standards and should not have been posted.

inquirerGeneral 9 months ago

[dead]

the_af 9 months ago

The Wayback machine is such an invaluable tool.

I've used it to track down when wording on a site (for someday relevant to my job) changed, for example.

[-]

nanna 9 months ago

I imagine there would be enough institutional support for the Way Back Machine from the likes of Wikipedia at least that even if the IA did go down the WBM would be carved off and kept alive. Effectively I think the IA will be broken up.

steeeeeve 9 months ago

There was a lot of public debate about this at the time. Kahle's argument made sense.

breck 9 months ago

[flagged]

caseysoftware 9 months ago

I worked at the Library of Congress on their Digital Preservation Project, circa 2001-2003. The stated goal was to "digitize all of the Library's collections" and while most people think of books, I was in the Motion Picture Broadcast and Recorded Sound Division.

In our collection were Thomas Edison's first motion pictures, wire spool recordings from reporters at D-Day, and LPs of some of the greatest musicians of all time. And that was just our Division. Others - like American Heritage - had photos from the US Civil War and more.

Anyway, while the Rights information is one big, ugly tangled web, the other side is the hardware to read the formats. Much of the media is fragile and/or dangerous to use so you have to be exceptionally careful. Then you have to document all the settings you used because imagine that three months from now, you learn some filter you used was wrong or the hardware was misconfigured.. you need to go back and understand what was affected how.

Cool space. I wish I'd worked there longer.

[-]

caseysoftware 9 months ago

Also.. it was fun learning the answer to "what is the work?"

If you have an LP or wire spool recording, the audio is the key, obvious work. But then you have the album cover, the spool case, and the physical condition of the media. Being able to see an album cover or read a reporter's notes/labeling is almost as important as the audio.

[-]

9 months ago

[deleted]

ForHackernews 9 months ago

Is the Library of Congress really beholden to copyright laws? I guess I assumed as the national deposit library they had a special exemption to copy any damn thing they pleased for archival purposes.

If they don't have that prerogative, they probably should, and Congress should legislate that to be the case.

[-]

aspenmayer 9 months ago

The Library of Congress and its staff determine fair use exceptions in certain contexts so I’m not sure who could find fault with them, as they could simply authorize it before or after the fact, from what I understand.

ErikAugust 9 months ago

“Page had always wanted to digitize books. Way back in 1996, the student project that eventually became Google—a “crawler” that would ingest documents and rank them for relevance against a user’s query—was actually conceived as part of an effort “to develop the enabling technologies for a single, integrated and universal digital library.” The idea was that in the future, once all books were digitized, you’d be able to map the citations among them, see which books got cited the most, and use that data to give better search results to library patrons. But books still lived mostly on paper. Page and his research partner, Sergey Brin, developed their popularity-contest-by-citation idea using pages from the World Wide Web.“

Larry Page had some cool ideas… can’t imagine Books will ever be resurrected, unfortunately.

[-]

dekhn 9 months ago

He really wanted to digitize all of them to provide reference and training data for early language models (well before LLMs, transformers, etc).

He also had a plan (with George Church) to build enormous warehouses holding large-scale biology research infrastructure right next to google data centers. Because most biology research is done at locations that have reached their limit on computational/storage capacity.

Larry had many good ideas but he struggled to get the majority of them off the ground. For example, when Trump was president and invited all the major tech leaders, Larry came with a plan to upgrade the US electrical system with long-range DC.

[-]

shiroiushi 9 months ago

>Larry came with a plan to upgrade the US electrical system with long-range DC.

I feel like some crucial detail is missing here. They already use HVDC for long-distance transmission lines, inside and outside of the US. Texas could benefit from it I suppose, but the US in general already uses it where appropriate AFAIK.

pyrale 9 months ago

> Larry had many good ideas but he struggled to get the majority of them off the ground. For example, when Trump was president and invited all the major tech leaders, Larry came with a plan to upgrade the US electrical system with long-range DC.

I fail to see how that would be a good idea.

To me, it looks like some magnate in a completely unrelated industry, who is megalomaniac enough to believe that they can enter a completely unrelated industry and explain to experts how things ought to get done.

[-]

dekhn 9 months ago

His father was a computer scientist and his brother Carl is an electrical engineer; Carl has done extensive research in the area (and runs at least one company working on related problems) and Larry is highly educated about this stuff. He's not a megalomaniac- just completely unrealistic in how he pursues his goals.

gosub100 9 months ago

He didn't originate the idea, it's well known that DC is more efficient but has other serious hurdles. It often takes innovative and influential people to break the status quo. E.g. Elon and EVs.

[-]

tmtvl 9 months ago

> E.g. Elon and EVs.

Man, Nissan just gets no love at all. They did the EV thing before Elon ran X into the ground (the bank, not the website formerly known as Twitter): https://www.motortrend.com/features/nissan-leaf-ev-history-p...

[-]

gosub100 9 months ago

The claim wasn't "Elon made the first-ever, mass-produced, street-legal 0 emissions vehicle powered by electricity.

xp84 9 months ago

I have no dog in the EV race, but the Leaf never had an important impact in the USA at least. They were incredibly impractical due to their short range. I also never heard anyone say they were fast either, so I have to credit Musk with making the first BEV that was both widely useful and fun to drive, which were two things we didn’t believe EVs were back then.

pyrale 9 months ago

> it's well known that DC is more efficient but has other serious hurdles.

If it's well known but no one has done it, maybe there are reasons for it? Building long-distance lines is a very capital-intensive decision, and if the cost-benefit analysis was better, projects would be done. Looking by DOE's communication [1], we can see that the cost-benefit analysis doesn't look good now for a big network, but also that HVDC projects that actually make economic sense have been built for decades.

Sure, sometimes innovative people make a big change, but what's the innovation there?

[1]: https://www.energy.gov/oe/articles/connecting-country-hvdc

[-]

gosub100 9 months ago

There absolutely are HVDC transmission lines under the sea. And capital intensive, revolutionary projects are ripe for disruption from people who have already pulled off similar feats, hence why it would make sense for a Google founder to take interest.

[-]

pyrale 9 months ago

> There absolutely are HVDC transmission lines under the sea.

Not sure where you see me claiming the opposite?

> And capital intensive, revolutionary projects are ripe for disruption from people who have already pulled off similar feats

If Larry Page is looking to waste a few billions in cables, he's absolutely welcome to do it. One HVDC connection between France and the UK is private, for instance [1]. If he believes there's money to be made similarly in the US, he should go for it.

That's not exactly the same offer as "Larry page has suggested the government should invest in expensive infrastructure for an industry he's never worked in".

[1] https://en.wikipedia.org/wiki/ElecLink

[-]

gosub100 9 months ago

> Not sure where you see me claiming the opposite

Hmm, maybe the part where you said "but no one has done it"?

What is the point of this level of degeneracy in conversation?

carlosjobim 9 months ago

> The idea was that in the future, once all books were digitized, you’d be able to map the citations among them, see which books got cited the most, and use that data to give better search results to library patrons.

You can do something similar to this already, by mapping which books are cited in Wikipedia articles. If you know how to do such a thing, because I don't.

[-]

aspenmayer 9 months ago

Not specific to Wikipedia:

https://aarontay.medium.com/3-new-tools-to-try-for-literatur...

https://archive.is/Ul13s

Specific to Wikipedia:

Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia [2024]

https://arxiv.org/abs/2406.19291v1

https://doi.org/10.48550/arXiv.2406.19291

> Wikipedia is an essential component of the open science ecosystem, yet it is poorly integrated with academic open science initiatives. Wikipedia Citations is a project that focuses on extracting and releasing comprehensive datasets of citations from Wikipedia. A total of 29.3 million citations were extracted from English Wikipedia in May 2020. Following this one-off research project, we designed a reproducible pipeline that can process any given Wikipedia dump in the cloud-based settings. To demonstrate its usability, we extracted 40.6 million citations in February 2023 and 44.7 million citations in February 2024. Furthermore, we equipped the pipeline with an adapted Wikipedia citation template translation module to process multilingual Wikipedia articles in 15 European languages so that they are parsed and mapped into a generic structured citation template. This paper presents our open-source software pipeline to retrieve, classify, and disambiguate citations on demand from a given Wikipedia dump.

Prior work referenced in above abstract with some team overlap:

Wikipedia citations: A comprehensive data set of citations with identifiers extracted from English Wikipedia [2021]

https://direct.mit.edu/qss/article/2/1/1/97565/Wikipedia-cit...

https://doi.org/10.1162/qss_a_00105

Datasets:

A Comprehensive Dataset of Classified Citations with Identifiers from English Wikipedia (2024)

https://zenodo.org/records/10782978

https://doi.org/10.5281/zenodo.10782978

A Comprehensive Dataset of Classified Citations with Identifiers from Multilingual Wikipedia (2024)

https://zenodo.org/records/11210434

https://doi.org/10.5281/zenodo.11210434

Code (MIT License):

https://github.com/albatros13/wikicite

https://github.com/albatros13/wikicite/tree/multilang

Bonus links:

https://www.mediawiki.org/wiki/Alternative_parsers

https://scholarlykitchen.sspnet.org/2022/11/01/guest-post-wi...

lqstuart 9 months ago

…and then they sold out to a Wall Street dickhead, and here we are

Zigurd 9 months ago

O'Reilly, for whom I've been a lead author and co-author, did this: https://www.oreilly.com/pub/pr/1042

They call it Founder's Copyright. The also use Creative Commons. The goal is to make out of print books available at no cost.

[-]

card_zero 9 months ago

> A complete list of available titles is at www.oreilly.com/openbook

Exciting!

Follows link

Link no longer exists, gets O'Reilly front page instead

"Introducing the AI Academy, Help your entire org put GenAI to work"

Thanks O'Reilly.

[-]

ToucanLoucan 9 months ago

The original dream of the internet: Information, freely available to any who want it.

The new dream of the internet: Some information, that aligns with the values of our advertisers, delivered via an LLM that sometimes makes shit up.

[-]

southernplaces7 9 months ago

Whose dream? It's specifically here on HN that I find the largest number of comments bitching about the uselessness of the internet and how they replaced their own search efforts with just asking a hallucinating LLM, as if they had no choice in the matter. Great way to help make things better friends....

If anything the internet today is more loaded than ever with cool information and useful stuff, especially as ever larger bodies of formerly analog content get digitized and often with full open access. If one can get over their myopic naval gazing and cultivation of fetishism about everything having gone to shit, it's not even hard to find most of that useful information.

The internet -like any complex thing with multiple interests involved in its existence and operation- is just whatever works best for different people in different contexts, commercially, personally, technically and so forth. It's neither an ideal that one should obsess over or something to be neatly pigeonholed into a box of how it "should be". Adapt, use its tools to make whatever parts of it you can fit whatever your personal ideal is, instead of endlessly blaming advertisers or people just trying to make a living from one more commercial landscape.

seanp2k2 9 months ago

Yeah, each year we inch closer to an internet where the only things to do revolve around buying things; watch “content” which mostly revolves around creators shilling products, research products, or buy products. Every hobby has to be monetized now, everything has to be a side hustle, every impression monetized. Few seem to bother anymore with personal blogs that exist for their own enjoyment and sharing of knowledge, and yet with all this paid creation, full-time artists struggle more than ever, largely unable to afford living costs in the very cities they helped to build the culture and value of.

I find it personally difficult to look at the entirety of the internet in 2024 and say that it’s definitely better for society than it was in 2004. I guess now at least we can mostly book appointments on our phones without having to speak with someone in real-time as they read dates and times off of a calendar interface that we can now just use ourselves directly.

[-]

southernplaces7 9 months ago

There's nothing wrong with people trying to make money on the internet. You can sit in your bubble of personal dislikes and preferences, whining about average people using a very accessible tool to try to make a living for themselves, just like you presumably do in some way or another, but why not instead see the bigger picture of an internet in which not all things are shit and not all commercialization is automatically bad.

Personal blogs, creative efforts and wonderful resources still abound on the internet and can still usually be found quite easily if you put a bit of effort into looking.

hotspot_one 9 months ago

and where the information you are looking for is plastered with ads.

yes I know adblocker, pihole, etc.

stvltvs 9 months ago

Looks like Openbook stuff is still there, just homeless. I had to do a web search to find it. For example:

https://www.oreilly.com/openbook/make3/book/

[-]

blacksmith_tb 9 months ago

Yes, I see it all with

https://www.google.com/search?q=site%3Aoreilly.com+inurl%3Ao...

So it seems like it mainly lost the overview page?

[-]

tourmalinetaco 9 months ago

It looks as though they killed the page sometime between June 7th and June 26th, although the page on June 7th seem to try to redirect to “https://oreilly.janrainsso.com/static/server.html?origin=htt...

https://web.archive.org/web/20240607220047/http://www.oreill...

Definitely perplexing, I can’t find the reason to kill what appears like a simple HTML page unless they’ve killed the project entirely.

tap-snap-or-nap 9 months ago

Basically of no good now

MollyRealized 9 months ago

It's okay, I'll just check the Wayb--shit

[-]

tourmalinetaco 9 months ago

Wayback Machine has been working for the past few days, look: https://web.archive.org/web/20240607220047/http://www.oreill...

[-]

Zigurd 9 months ago

It looks like the pdfs of at least some of the books are there. But it's a pretty small subset of their out of print books.

9 months ago

[deleted]

microtherion 9 months ago

It's somewhat ironic that, while the individual books are still accessible, their index pages https://www.oreilly.com/free and https://www.oreilly.com/openbook both redirect to some AI propaganda these days, with no links to the books left.

A third party page still has links to some (possibly all) of the books: https://zapier.com/blog/free-oreilly-press-books/

Zigurd 9 months ago

I could not find a listing of available downloads. Sent an email inquiring about this. Will report if I get anything enlightening back.

svilen_dobrev 9 months ago

This seems to be the fate of knowledge/content that stays in institutions which have been built with the idea of collecting it and growing it.. but have turned into walled gardens/crypts of sort. Rot/Rust and be forgotten.

A very cynical and dark view is that the New things/people need that oblivion in order to feel great, for not haveing to compare with old great-er ones. Rewriting history as it seems fit the current powers-that-be, is easier this way.

Or may be it's just collective stupidity? or societal immaturity ?

(i am coming from completely different killed project on a different continent, but the idea is the same)

[-]

gosub100 9 months ago

The books project was very early in Google's history. Possibly before their IPO. Since then, they've shed their don't be evil motto and shitcanned the 20% time affordance for new projects.

I think it's neatly summarized in two words: shareholder growth.

kyleee 9 months ago

I think you are on to something, people frequently don’t want to grapple with and understand what has been done before, they prefer to just wing it and move forward on their own.

shiroiushi 9 months ago

I seriously doubt there's very much highly relevant old knowledge locked away somewhere. Is there interesting stuff we don't have good access to? Sure, but mainly of interest to historians (pro or amateur). You're not likely to find the cure for cancer written down in some 1000-year-old book somewhere. And while a few people might really be interested in reading decades or centuries-old novels that weren't popular enough to be called "classics" now, the vast majority of people aren't going to find such stories about people in the distant past all that interesting.

Of course, it's best to preserve past knowledge, but I think the idea that this is part of some kind of conspiracy to keep people buying new stuff is pretty silly. People are always going to want new stuff, as society grows and changes.

[-]

sersi 9 months ago

> find the cure for cancer written down in some 1000-year-old book somewhere.

While you're most likely right about the cure for cancer, I did want to note that this is kind of how the cure for malaria was found (artemisinin). Tu Youyou who won the nobel prize systematically investigated Traditional Chinese Medecine remedies until she came across one that was effective. That particular remedy was described in a 1600 years old text The Handbook of Prescriptions for Emergency Treatments, written in 340 by Ge Hong.

Note: before anyone think that the fact that a remedy described in traditional Chinese texts mean that TCM is reliable and a viable alternative in all text, she screened over 2,000 traditional Chinese recipes and made 380 herbal extracts, from some 200 herbs which were tested on mice. So, yes one of the remedy were successful but the success rate of TCM was not particularly high :)

[-]

shiroiushi 9 months ago

This is really interesting actually, but I do want to point out that I'm not saying all old knowledge is only of interest to historians: I was arguing against the idea that some old knowledge is "locked away" and not publicly available as part of some kind of conspiracy.

Lots of old knowledge is readily available to the public. The complete works of Shakespeare are a good example here, as is Homer's epics The Odyssey and The Iliad. (I don't know if that TCM stuff is or not.) These kinds of things are considered "classics" and are frequently reproduced, now available online in countless places, etc. Obviously, lots of people besides historians think they're important, and so they're copied frequently and made available at large. As the famous rule goes, "99% of everything is crap", so probably all the best stuff from the past is well-preserved, and the rest, not so much. I seriously doubt that something as good as Shakespeare is locked away in someone's private library and virtually unknown to almost anyone.

Of course, there's always exceptions and you never know when some overlooked tidbit of info from the distant past might be really useful, as you showed here, so I do think it's important to preserve and enable easy access to as many old works as possible.

hotspot_one 9 months ago

> I seriously doubt there's very much highly relevant old knowledge locked away somewhere.

Interesting take on what "knowledge" means and what makes knowledge valuable.

If I understand "knowledge" as "information directly relevant to a technical problem", then:

- the knowledge which remains relevant to that problem will stay available to practitioners (i.e. the properties of a Gaussian distribution, from Gauss, 1809)

- the knowledge which is no longer relevant to that problem will probably be lost (how to compute the integral of a Gaussian using a slide rule. Slide rules first developed circa 1620, last used circa 1970)

In other words, yes, your point is profoundly true. Knowledge relevant to a specific task stays available, not relevant gets pruned quickly.

My question would be if we want to use that definition of relevant and that understanding of what drives value. i.e. I'm not asking if you are correct, I've just shown that you are correct. My question is if the assumptions/values which make this correct are assumptions/values we are comfortable with. In other words, is is wise?

[-]

shiroiushi 9 months ago

It's not just technical knowledge either: look at Shakespeare's works. They're centuries old, but there's absolutely no danger of those disappearing. Lots of old stuff is well-preserved and highly duplicated for easy access.

lanstin 9 months ago

My video streaming data supports this idea even tho I rewatch all Star Trek series (except TOS).

SapporoChris 9 months ago

I am fairly certain there is more knowledge/content available to anyone in this century than last century or any century before it. But perhaps I have misread your comment.

submeta 9 months ago

With library genesis, who needs Google Books anymore? I buy books physically to support the author/s and download an epub version from said site to my kindle. The physical books I hardly read, they are for my shelf. Although I love the feeling of printed books, but I read in bed, and it‘s easier to hold an ebook. Also I read when I commute. It’s lighter to have my Kindle Oasis with me with tons of books on it.

[-]

kccqzy 9 months ago

Someone needs to scan the book and upload it to library genesis. The article said Google had developed this massively efficient apparatus for scanning (or taking photographs of) books, and most of the article was about out-of-print books.

I personally have actually tried to contribute to libgen a particular difficult-to-find-online book by buying it, scanning it, and uploading it. There need to be more people doing this.

[-]

SauntSolaire 9 months ago

Did you use a scanning service or do it yourself?

[-]

kccqzy 9 months ago

I did it myself. A few hours for 500 pages. Surprisingly soothing.

ghaff 9 months ago

There’s the everything available online for free mindset. But, yes, I’ve basically donated all my books that were in the public domain. And, in general, have been massively purging my book collection of stuff I won’t realistically read again.

[-]

submeta 9 months ago

I do buy books, to support the authors. And I would encourage anyone to support the authors they like to read.

[-]

ASalazarMX 9 months ago

I agree, but also wouldn't lose sleep for pirating a book of an author that died more than 20 years ago, in most contexts.

layer8 9 months ago

Many books aren’t on libgen. It’s been rather hit and miss for me.

[-]

consf 9 months ago

[dead]

hotspot_one 9 months ago

how sure are you that library genesis will remain available?

[-]

submeta 9 months ago

Well, we got a large number of mirrors. Just like scientific hub. But to be honest: We cannot be sure. That’s why I have my physical books as well ;)

consf 9 months ago

[dead]

thayne 9 months ago

IMO if a work is out of print (or equivalent depending on the medium) for more than a few years, it should be released into the public domain. Or maybe something like the public domain, but requires attribution.

[-]

kps 9 months ago

Like trademark: Use it or lose it.

(The reality is that publishers would put lazy photocopies up for sale at ten zillion dollars a piece.)

eschneider 9 months ago

Have you dealt with publishers? If a work is out of print for a few years, much better to have rights revert to the creator.

[-]

WillAdams 9 months ago

Even that doesn't always work --- I was rebuffed by Joan Turville-Petre's son when I asked for a license to reprint his mother's notes on J.R.R. Tolkien's translation of _The Old English Exodus_ on the grounds that he would prefer to work with an academic, rather than an individual.

Anyone know an academic specializing in Old English who would like to oversee this reprinting? I have a typeset PDF which only wants proofreading and updating of the index.

[-]

aspenmayer 9 months ago

Public Domain Review?

giraffe_lady 9 months ago

Then every book will be immediately out of print after its initial run, while the not-quite-a-cartel of publishers all decline to print it until it hits the point where they no longer have to pay the author.

[-]

Jtsummers 9 months ago

Then the publisher loses out on exclusive publishing rights and also loses money. It's in their interests to keep it in print so long as it's a profitable book, even if they have to pay some percentage to the author. Once it goes into public domain every publisher can reprint it and the original publisher has to compete with them on price.

[-]

WillAdams 9 months ago

Part of the problem here is a change in tax law a couple of decades back where the backlist as a bunch of printed books in a warehouse became a tax liability when Congress tried to close a tax loophole which non-publishers were exploiting.

Rather than the old model of printing a reasonable print run, selling books as demand allowed, and keeping unsold books in warehouses only paying tax as they were sold, the new laws required paying tax on inventory each year --- so any books not sold in the first year were not as profitable, hence book remaindering, and the current mess.

jamiek88 9 months ago

> so long as it's a profitable book

And here is the rub. You’ll end up with three or four super authors with the rest being ripped off.

Much better for it to revert to the author in that situation IMO.

[-]

Jtsummers 9 months ago

I'm not arguing for it (or against it for that matter), I was just pointing out that the analysis in the comment I responded to didn't make sense. Every book won't be allowed to fall out of print and copyright just to exploit the authors because it would also hurt the publishers, they also benefit from exclusive publishing rights. Publishing rights are granted by the copyright holder (the author) to the publisher, much like patent licenses.

Regarding unprofitable books, they'll fall out of print anyways because they're unprofitable. Those authors won't be getting ripped off because they won't be making money either way beyond initial commissions and what few sales they get.

> Much better for it to revert to the author in that situation IMO.

The publisher doesn't hold the copyright, the author does, so copyright (the particular right under discussion) can't revert to the author as it never left the author. What the publisher holds is publishing rights per a contract with the author. That could revert back to the author (or be voided or however it's structured), and that would be reasonable but we don't need any laws for it, that would fall under normal contract terms. Whether it's a common thing now or feasible for a particular author (with no clout? maybe not, with billions in sales from prior books? probably) is another matter.

9 months ago

[deleted]

giraffe_lady 9 months ago

9 months ago

[deleted]

tap-snap-or-nap 9 months ago

Do we really need publishers anymore?

[-]

bloak 9 months ago

They are useful for quality control when the author is not well known.

[-]

ghaff 9 months ago

As someone who has both independently published and gone through a technical publisher, there’s still a stamp of approval, fully deserved or not, associated with your book being published by a known name.

pfdietz 9 months ago

So, e-books are either immediately out of print, or never out of print?

[-]

thayne 9 months ago

By "in print" I mean, the publisher is actively selling it.

Although, if I were writing the law I would require selling DRM free ebooks for ebooks to count for maintaining the copyright.

[-]

pfdietz 8 months ago

So, if an e-book is listed on Amazon, does that count as "actively selling it"? It's for sale, after all. The same happens with Print On Demand (POD).

This is actually a real issue when authors have deals where rights revert to them when the book goes out of print.

https://katemckean.substack.com/p/when-does-a-book-go-out-of...

tightbookkeeper 9 months ago

What if we applied the simple test that the book was originally published on paper and no other printings have occurred (digital or paper).

pessimizer 9 months ago

Never out of print. If there's an e-copy available to buy, that's better than millions of other books.

xipho 9 months ago

A huge proportion of this corpus is found in the Hathi Trust (see https://www.hathitrust.org/the-collection/). We have had a grant to crawl and derive an index on it via their supercomputing resources. I'm sure they are looking to LLM proposals, though they are exceedingly careful about the copyright issues.

https://www.hathitrust.org/

[-]

jsemrau 9 months ago

>I'm sure they are looking to LLM proposals

Well, it is a use case for this challenge https://www.kaggle.com/competitions/gemini-long-context

fredgrott 9 months ago

thank you as some of us were looking for something to replace the archive.org digital book library part....

boramalper 9 months ago

Of course someone needs to scan/digitise those books but for those which already are, there is Anna’s Archive.

https://en.wikipedia.org/wiki/Anna%27s_Archive

[-]

fx1994 9 months ago

it's a shame you have to pirate your way to find a book that is practically unavailable, but I support pirating old unavailable stuff

theendisney4 9 months ago

Programmers not law makers really control what goes and doesnt online.

Bittorent and ipfs etc are nice but things would be better if there was a large static archive with desktop clients exchanging chunks in a complex modular way.

Say: I have pages 1-15 of file 123456, you have page 16 but are looking for page 1 of doc 2345, if i can obtain that page a fast exchange is possible. If not a different module can issue an iou that either means i owe something, you are owed something or both. Other modules could create groups that aim to store part of the archive without duplication amoung members. Spam driven modules could also be interesting.

The archive can be organized by how dubious the copyright is so that one can limit participation to 50 or 100+ year old publications and/or living or dead authors.

Its not unlike living on a far away island with the british empire seeking to control every aspect of your life without sufficient means of force.

carlosjobim 9 months ago

For Kagi users, I recommend putting books.google.com as a pinned domain. This way, you'll many times be presented with some of the best sources for any search query. Then it's a matter of finding the ePub file of that book. To read on MacOS, FBReader is a high quality app.

[-]

emmelaich 9 months ago

Thanks. Looks like it's available for Windows/Linux too. At last as of FBReader 2.1.2 30th September 2024.

Animats 9 months ago

We need a Copyright Term Reduction Act.

It's time. 50 years, renewal is possible but expensive.

[-]

mjevans 9 months ago

Just my opinion but as a starting point for the argument...

  * 20 years from date of first publish (renewable up to CAP? 50 years)
  * Must remain available every year
  * 10 year renewal blocks with massive registration fee increases
  * Compulsory maximum license fee cap (can offer for less) in the laws

Note this is not TRADE MARK; trade marks are _consumer protection_ related to 'brand ownership'.

js8 9 months ago

Google Books is a tragedy of the commons problem, created by copyright, which is supposedly a solution to the tragedy of the commons problem.

ASalazarMX 9 months ago

Even 50 is a lot, because it starts at the death of the author. Popular culture shouldn't remain locked out for generations. 50 maximum would be ideal, two generations from the one who experienced it in the original cultural context.

[-]

Animats 9 months ago

50 years from first publication. That's all the TRIPS agreement requires.[1]

[1] https://en.wikipedia.org/wiki/TRIPS_Agreement

[-]

ASalazarMX 9 months ago

And that's still a lot, since fifty years from publication is the minimum to abide to TRIPS, but we're used to much worse, so it doesn't sound as bad now. It could be shorter, things move a lot faster nowadays, a single generation of monopoly means more today than a hundred years ago.

gosub100 9 months ago

And reign in the damages for infringement to some amount closer to what was actually lost. For instance, if someone has a million books on a drive they haven't deprived the publisher of a million sales for chrissakes

rekabis 9 months ago

Let’s rewrite copyright law:

1. The author gets to say, “I produced this”, and to control if it gets published.

2. Exclusive copyright for 15 year terms.

3. Renewal possible if author still alive. Non-human rights holders (corporations, etc.) limited to 30 years total (one renewal) from date of first publication, regardless of item ownership. Failure to renew automatically opens up the product.

4. Existing copyright can be overridden if demand isn’t being adequately serviced (sliding scale, challenger must capture minimum % of existing market demand to prove). Pricing of overriding attempts must be reasonable, only cost of production can be directly paid for, everything else goes into an escrow account until the attempt is concluded. This is where anti-abuse rules for both sides are most extensive.

Information and knowledge must be free. Our civilization depends vitally upon that freedom.

senkora 9 months ago

I’m sure the lawyers will eventually figure out a way to train an LLM on them.

[-]

datadrivenangel 9 months ago

They probably already have! It seems like an amazing training dataset even if you can't share source data.

[-]

amelius 9 months ago

How do you train an LLM such that it is guaranteed to never regurgitate its training data?

[-]

ASalazarMX 9 months ago

You punish it if parts of the answer can be found in its training data, and reward it otherwise.

[-]

amelius 9 months ago

But the whole point of the training is that you reward it if it correctly reproduces the next token.

[-]

zeroxfe 9 months ago

That's not the whole point of the training. It's just (very loosely) a measure of loss used during pre-training. There are many post-training and alignment stages in a typical model that are designed to reward high-quality responses.

Technically, yes, it's impossible to guarantee that it won't just regurgitate source material (which is mostly around the tails of the data distribution), but the whole point of training is to build generalized intelligence.

[-]

amelius 9 months ago

I guess I used the wrong wording but it doesn't change the argument. Yes, the whole point of training is to build generalized intelligence (or at least that's what we __hope__ for). But as far as I understand, we do it __mainly__ by training for the next word in the sequence.

PS: you speak of "pre-training" and "post-training", so I'm curious what you think is the main part of the training (?)

einpoklum 9 months ago

Written from a capitalist perspective, extolling "market forces" and legitimizing corporate and government limitations on copying.

"between 1923 and 1963 ... copyrights back then had to be renewed, and often the rightsholder wouldn’t bother filing the paperwork" - oh no, how terrible. How lucky we are that in these modern times one doesn't even have to file paperwork in order to prevent you from copying information.

and they go on to suck to Google and decry how they didn't get to legitimize their control over a large swath of human knowledge and cultural heritage.

"It certainly seems unlikely that someone is going to spend political capital—especially today—trying to change the licensing regime for books, let alone old ones." <- copyright regime, licensing regime - all of this stuff is illegitimate apriori. Poetry, literature, music, software, papers and books - we cannot and must not tolerate restrictions on their dissemination.

What arrangements the commercial and governmental entities come to, our "arrangement" should be that everything gets disseminated widely and without restriction, so that curtailment, censorship, commercial control etc. just fail.

shadytrees 9 months ago

James Somers writes beautifully; https://www.newyorker.com/contributors/james-somers has some of his other writing

mcepl 9 months ago

> Copyright terms have been radically extended in this country largely to keep pace with Europe, where the standard has long been that copyrights last for the life of the author plus 50 years. But the European idea, “It’s based on natural law as opposed to positive law,” Lateef Mtima, a copyright scholar at Howard University Law School, said. “Their whole thought process is coming out of France and Hugo and those guys that like, you know, ‘My work is my enfant,’” he said, “and the state has absolutely no right to do anything with it—kind of a Lockean point of view.” As the world has flattened, copyright laws have converged, lest one country be at a disadvantage by freeing its intellectual products for exploitation by the others. And so the American idea of using copyright primarily as a vehicle, per the constitution, “to promote the Progress of Science and useful Arts,” not to protect authors, has eroded to the point where today we’ve locked up nearly every book published after 1923.

This is disingenuous: the article doesn’t mention that the biggest proponent of the prolonging of the copyright terms were Americans (e.g., Walt Disney Corp and Jack Valenti, see “Mickey Mouse Protection Act” for more) not Europeans.

2OEH8eoCRo0 9 months ago

The tragedy is that Google is tasked with this at all. It would be cool if public libraries could work together on a massive public digital library. This shouldn't be Google's responsibility.

[-]

Jtsummers 9 months ago

Google wasn't tasked (by a third party) with this, they chose to do it.

[-]

ants_everywhere 9 months ago

arguably Google was invented to fund this project.

The books project predates the search engine and the search engine grew out of the project of creating a universal digital library. The PageRank algorithm is one of a class of algorithms used to score citations in books and papers.

dredmorbius 9 months ago

HathiTrust was ... nearly this.

Until it too was emasculated.

<https://en.wikipedia.org/wiki/HathiTrust>

Otherwise, we have Project Gutenberg (public domain), OpenLibrary (Internet Archive, both PD and copyrighted works), ZLibrary, Library Genesis, and Anna's Archive.

NoMoreNicksLeft 9 months ago

All humans everywhere have a responsibility to preserve culture and knowledge to the best of their ability. I think what you meant to say is that none of us can trust Google with this important task.

[-]

renewiltord 9 months ago

One of the great tragedies of civilization is that we leave things in the hands of those who do them rather than in the hands of those who tell us about our responsibility to do them.

[-]

NoMoreNicksLeft 9 months ago

I personally do what I can. I've been trying to find old phone books and catalogs at garage sales, scanning them when I can get them. I teach my children that this is a responsibility of theirs.

But if you're offering me even a fraction of Google's budget, I think I might manage to scale things up.

[-]

renewiltord 9 months ago

Perhaps a second tragedy is that we give money to those who provide us with something. A better world might be where we give Google’s money to people so that they can teach children to buy phone books at garage sales. In this way, civilization may prosper.

[-]

Apocryphon 9 months ago

A simple search for "phone books" on Google Books yields no actual phone books, so the poster is objectively doing a better job than Google on that front.

mparnisari 9 months ago

Would it not be a viable solution to let Google scan and sell books, but force them to give the profit from the sales to the government?

DrNosferatu 9 months ago

I never seen an explicit mention if the Google Books corpus was indeed or not used for training LLMs…

Anyone knows more about it?

kbbgl87 9 months ago

> “Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them.”

Greeted with a paywall on the source. Hypocracy...

tempfile 9 months ago

> what happened with piano rolls, with records, with radio, and with cable—isn’t that copyright holders squash the new technology. Instead, they cut a deal and start making money from it.

> “History has shown that time and market forces often provide equilibrium in balancing interests,” Wu writes.

It is completely braindead to argue that market forces had anything to do with compulsory licensing. It is a matter determined by courts in the public interest.

anoncow 9 months ago

Sad and criminal.

afh1 9 months ago

Ironically behind a paywall (and below a political ad)

LisaDziuba 9 months ago

[dead]

geniium 9 months ago

TL;DR: bye bye Google

pluc 9 months ago

[flagged]

[-]

9 months ago

[deleted]

datadrivenangel 9 months ago

Thanks Paul!

[-]

pvg 9 months ago

Wrong number, I'm afraid.

[-]

montag 9 months ago

Thanks Peter

[-]

Evidlo 9 months ago

[flagged]

dang 8 months ago

(We detached this subthread from https://news.ycombinator.com/item?id=41917170)

andrewstuart 9 months ago

Google must be tempted to put them in an LLM.

[-]

bborud 9 months ago

It would surprise me greatly if they haven't already.

[-]

johnobrien1010 9 months ago

Another reason that they should never have been allowed to ingest all the books in the first place. Without paying for the rights to use the digital form of the book, a use which is explicitly prohibited by the publisher, they digitized the books anyway. If they used it to train an LLM, and the LLM regurgitates near facsimiles of all the copyrighted works without compensation to the original rights holders, that seems like something that should be illegal.

renewiltord 9 months ago

Good. It’s important that free access not be permitted. We don’t know what personal data might be contained within. We should only allow those works after a human (appropriately certified) has verified that no personal data exists within.

If it exists within the book must be destroyed in its entirety. Too many works of so-called scholarship have relied on the personal letters of dead people.

We should not reward grave robbing. The most important thing is the personal data. We must protect the personal data.

[-]

mparnisari 9 months ago

> We don’t know what personal data might be contained within

You really think that a book author doesn't know that whatever they put on a book will stay there forever?

What a weird take :p

[-]

renewiltord 9 months ago

They could have put other people’s personal data in. The only thing that matters is personal data. We have a right to making sure it’s removed. Once these massive privacy risks are dealt with we can start again. But only after we have a comprehensive report of the environmental consequences of this.

We’ve all had enough of people just stomping over our rights in this move fast and break things / ask forgiveness instead of permission crap.

Do it right. And respectfully instead of abusing people’s data.