The Tragedy of Google Books (2017)

(theatlantic.com)

239 points | by lispybanana 13 hours ago ago

99 comments

  • philipkglass 12 hours ago

    These Google scans are also available in the HathiTrust [1], an organization built from the big academic libraries that participated in early book digitization efforts. The HathiTrust is better about letting the public read books that have actually fallen into the public domain. I have found many books that are "snippet view" only on Google Books but freely visible on HathiTrust.

    If you are a student or researcher at one of the participating HathiTrust institutions, you can also get access to scans of books that are still in copyright.

    The one advantage Google Books still has is that its search tools are much faster and sometimes better, so it can be useful to search for phrases or topics on Google Books and then jump over to HathiTrust to read specific books surfaced by the search.

    [1] https://www.hathitrust.org/

    • dredmorbius 6 hours ago

      HathiTrust is a fine example of a repository which is in theory useful but in practice all but useless.

      Participation is limited to tertiary academic institutions, and possibly only four-year (rather than two-year) ones. This excludes local (city/county) libraries, as well as primary/secondary (grammar / middle / high school in the US) libraries.

      Even public-domain records cannot be downloaded in whole, but rather can be saved one page at a time as PDFs. I'm pretty sure that those interested in more useful archival will and/or have created automated tools to do so, but HathiTrust remains the most notable point-of-access for such works, and the additional generation of conversion and republication further degrades the quality of original-publication formats. (It's less a problem for regenerated works from OCR'd or manually-converted documents, but those of course lose all the characteristics of original publication.)

      And of course, many materials still under copyright are not accessible to the general public at all, no matter how obscure. I'd run into a case of this some months back trying to get a date attribution of an Alan Watts lecture which had been posted to HN:

      <https://news.ycombinator.com/item?id=41231047> (thread).

      And my request still stands. Anyone with an academic affiliation who can check <https://catalog.hathitrust.org/Record/000678503> and see how it relates to this post (<https://news.ycombinator.com/item?id=41230841>) would have my gratitude.

      • bgoated01 3 hours ago

        I just put in a request at my university library for that item. I'll let you know what it turns up.

      • Eisenstein 2 hours ago

        You might want to look for:

        Watts, Alan. Myth and Religion : the Edited Transcripts. First edition. Boston: Charles E. Tuttle Co., 1996.

        It contains "Jesus - His Religion, Or the Religion About Him", which appears to be a very slightly different title from the work that you are searching for.

        • dredmorbius 2 hours ago

          I'd found that text at the time as noted here: <https://news.ycombinator.com/item?id=41235652>.

          The text includes the transcripts, but doesn't include the original date(s) of delivery / publication. And it's published a quarter century after the initial records of the lecture.

          As noted, I'd emailed the Alan Watts institute but have received no reply.

    • acidburnNSA 8 hours ago

      Hathitrust has been absolutely transformative for me, as an amateur nuclear enterprise historian.

  • yonran 11 hours ago

    > Dan Clancy, the Google engineering lead on the project who helped design the settlement, thinks that it was a particular brand of objector—not Google’s competitors but “sympathetic entities” you’d think would be in favor of it, like library enthusiasts, academic authors, and so on—that ultimately flipped the DOJ.

    I was at Google in 2009 on a team adjacent to Dan Clancy when he was most excited about the Authors’ Guild negotiations to publish orphan works and create a portal to pay copyright holders who signed up, and I recall that one opponent that he was frustrated at was Brewster Kahle of the Internet Archive, who filed a jealous amicus brief (https://docs.justia.com/cases/federal/district-courts/new-yo...) complaining that the Authors’ Guild settlement would not grant him access to publishing orphan works too. In my opinion Kahle was wrong; the existence of one orphan works clearinghouse would have encouraged Congress to grant more libraries access instead of doing nothing which is what actually happened in the 15 year since then. Instead of one company selling out-of-print but in-copyright books, or multiple organizations, no one is allowed to sell them today.

    Since then, of course, Brewster Kahle launched an e-library of copyrighted books without legal authorization anyway which will probably be the death of the current organization that runs the Internet Archive. Tragic all around.

    • chambers 10 hours ago

      I wish the contradiction you spotted was clear on their Wikipedia page. It demonstrates how far back IA's management troubles go, and how their clean image was maybe just an image.

      For me, I became concerned when they fibbed about why the Internet Archive Credit Union was liquidated. IA alleged it was shut down due to onerous regulations, but the government said IA actually never lived up to their goal of allowing local, low-income folk to sign-up for their service. https://ncua.gov/newsroom/press-release/2016/internet-archiv...

    • mastazi 10 hours ago

      This is an insightful comment and I thank you for sharing it but, after having looked at the brief you linked

      > a jealous amicus brief that the Authors’ Guild settlement would not grant him access to publishing orphan works too

      that's not a fair overview of the amicus brief, there are good points there about the process of notifying orphan works rights holders and about the risk of a monopolistic position. I do agree with you on this part though

      > the existence of one orphan works clearinghouse would have encouraged Congress to grant more libraries access instead of doing nothing

      Edit: I also agree with you that the way the IA subsequently created its e-library was not ideal.

      • yonran 8 hours ago

        > that's not a fair overview of the amicus brief, there are good points there about the process of notifying orphan works rights holders and about the risk of a monopolistic position

        What I meant by “jealous” is that the Internet Archive’s interest was not to improve author notification or to protect foreign authors; it was to provide a competing service under similar or better terms than Google was able to negotiate without spending the time and money that Google did litigating. Kahle wanted what was in Google’s settlement.

        And what I meant by “Kahle was wrong” is not that every argument that his lawyers thought up was false; I think the agreement was later amended to fix some issues. My point is that Kahle’s theory of change was wrong. He thought that when the settlement was rejected, then Google would push Congress to create an orphan works law which the Internet Archive could use to publish old books too. As he wrote in his op-ed, “We need to focus on legislation to address works that are caught in copyright limbo. … We are very close to having universal access to all knowledge. Let's not stumble now.” https://www.washingtonpost.com/wp-dyn/content/article/2009/0... As it turns out, the rejection of the class action settlement did not cause Congress to create an orphan works law. In retrospect, we would have been more likely to get an orphan works law if Google had been allowed to set up a proof of the concept, making the monopoly on orphan works temporary.

        • cxr 5 hours ago

          There's such a weird tone to your posts. It's as if they're meant to give the impression that Kahle had a substantial, if not singlehanded, influence over the outcome. In reality, his input probably didn't have even the impact that Kahle himself hoped for and the appropriate adjective to describe the effect is probably "negligible", if at all. It was a class action lawsuit with extremely dubious underpinnings where over 6,000 people wrote in to ask that they not be considered part of the class.

      • lokar 9 hours ago

        I would say it’s much worse then “not ideal”, they may have poisoned the well for decades to come.

        • adastra22 8 hours ago

          Maybe permanently, as societal stances on these sorts of issues tend to solidify over time. In a couple of generations the very idea of a library may be confined to history thanks to IA :(

    • jamiek88 10 hours ago

      That pandemic library was a huge, obvious over step by him.

      It will have consequences far beyond the immediate lawsuit too.

      The very concept has basically been iced for a generation and the net is only getting more locked down not less.

      • jmb99 3 hours ago

        Fortunately (by some definition of fortunately), most countries don’t agree on exactly how the web should be “locked down.” This benefits at least some people (like me) who live in countries who make also no effort to restrict what can be shoved down the internet tube, including from countries that don’t particularly care about western copyright law. Would it be nice to have a fully sanctioned pandemic library-style service? Absolutely. But I have never once looked for a textbook, paper, regular book, etc online and not found a copy for free. Usually takes the same amount of time or less compared to finding a copy on Amazon (if it’s currently in print), and almost always less time using my library’s clunky online ebook platform[1].

        Is that legal? Technically yes, in my country. Is it ethical? Debatable, depending on who you’re asking. But for me personally, I have found it to be getting substantially easier to find high quality copies of copyrighted anything in the past 3-5 years compared to 10-15 years ago, so I don’t necessarily agree with the blanket statement that “the net is only getting more locked down.”

        [1] I like to use the library as much as possible, if for nothing else than to increase usage numbers to marginally positively decrease the likelihood of finding cuts.

    • kmeisthax 2 hours ago

      How would a settlement with the Authors' Guild cover orphan works? If the Authors' Guild is in a position to grant a license, then it's not an orphan work. The whole orphan works problem is that for a lot of valueless works, nobody knows who owns what.

    • shkkmo 7 hours ago

      > In my opinion Kahle was wrong; the existence of one orphan works clearinghouse would have encouraged Congress to grant more libraries access instead of doing nothing

      Maybe. I think that is a pretty optimistic view of congress and our political process. I would argue that having a powerful, rich company with a monopoly to lose would have made passing such a law less likely, not more.

      I do think we would have been better off with a Google monopoly on unpublished unclaimed books than with the lack of access we have today.

      The article says:

      > You’d get in a lot of trouble, they said, but all you’d have to do, more or less, is write a single database query. You’d flip some access control bits from off to on. It might take a few minutes for the command to propagate.

      If it's so easy, I'm suprised nobody has done it and accepted the consequences. It seems one of the largest single positive impacts any person could make on the world. Once it's released, it'll never go back in the box. A modern Pandora.

    • pessimizer 10 hours ago

      Thanks for making me aware of this. This guy's heart is clearly (to me) in the right place, but his understanding of power is seriously lacking. That's probably what gave him the hubris to create Wayback and IA, but he'll be absolutely dumbstruck when they shut it down.

      • kragen 8 hours ago

        He won't be surprised at all. His slogan is "governments burn libraries". He's been able to forestall that for a while, and even provide public access, but permanence of the IA as an institution was never in the cards, given its subversive goal: universal access to all human knowledge.

        Guess where the first backup copy of the Internet Archive is located.

      • the_af 8 hours ago

        The Wayback machine is such an invaluable tool.

        I've used it to track down when wording on a site (for someday relevant to my job) changed, for example.

  • caseysoftware 11 hours ago

    I worked at the Library of Congress on their Digital Preservation Project, circa 2001-2003. The stated goal was to "digitize all of the Library's collections" and while most people think of books, I was in the Motion Picture Broadcast and Recorded Sound Division.

    In our collection were Thomas Edison's first motion pictures, wire spool recordings from reporters at D-Day, and LPs of some of the greatest musicians of all time. And that was just our Division. Others - like American Heritage - had photos from the US Civil War and more.

    Anyway, while the Rights information is one big, ugly tangled web, the other side is the hardware to read the formats. Much of the media is fragile and/or dangerous to use so you have to be exceptionally careful. Then you have to document all the settings you used because imagine that three months from now, you learn some filter you used was wrong or the hardware was misconfigured.. you need to go back and understand what was affected how.

    Cool space. I wish I'd worked there longer.

    • caseysoftware 11 hours ago

      Also.. it was fun learning the answer to "what is the work?"

      If you have an LP or wire spool recording, the audio is the key, obvious work. But then you have the album cover, the spool case, and the physical condition of the media. Being able to see an album cover or read a reporter's notes/labeling is almost as important as the audio.

    • ForHackernews 9 hours ago

      Is the Library of Congress really beholden to copyright laws? I guess I assumed as the national deposit library they had a special exemption to copy any damn thing they pleased for archival purposes.

      If they don't have that prerogative, they probably should, and Congress should legislate that to be the case.

      • aspenmayer 5 hours ago

        The Library of Congress and its staff determine fair use exceptions in certain contexts so I’m not sure who could find fault with them, as they could simply authorize it before or after the fact, from what I understand.

  • ErikAugust 12 hours ago

    “Page had always wanted to digitize books. Way back in 1996, the student project that eventually became Google—a “crawler” that would ingest documents and rank them for relevance against a user’s query—was actually conceived as part of an effort “to develop the enabling technologies for a single, integrated and universal digital library.” The idea was that in the future, once all books were digitized, you’d be able to map the citations among them, see which books got cited the most, and use that data to give better search results to library patrons. But books still lived mostly on paper. Page and his research partner, Sergey Brin, developed their popularity-contest-by-citation idea using pages from the World Wide Web.“

    Larry Page had some cool ideas… can’t imagine Books will ever be resurrected, unfortunately.

    • dekhn 11 hours ago

      He really wanted to digitize all of them to provide reference and training data for early language models (well before LLMs, transformers, etc).

      He also had a plan (with George Church) to build enormous warehouses holding large-scale biology research infrastructure right next to google data centers. Because most biology research is done at locations that have reached their limit on computational/storage capacity.

      Larry had many good ideas but he struggled to get the majority of them off the ground. For example, when Trump was president and invited all the major tech leaders, Larry came with a plan to upgrade the US electrical system with long-range DC.

      • shiroiushi 4 hours ago

        >Larry came with a plan to upgrade the US electrical system with long-range DC.

        I feel like some crucial detail is missing here. They already use HVDC for long-distance transmission lines, inside and outside of the US. Texas could benefit from it I suppose, but the US in general already uses it where appropriate AFAIK.

    • lqstuart 2 hours ago

      …and then they sold out to a Wall Street dickhead, and here we are

    • carlosjobim 10 hours ago

      > The idea was that in the future, once all books were digitized, you’d be able to map the citations among them, see which books got cited the most, and use that data to give better search results to library patrons.

      You can do something similar to this already, by mapping which books are cited in Wikipedia articles. If you know how to do such a thing, because I don't.

  • Zigurd 12 hours ago

    O'Reilly, for whom I've been a lead author and co-author, did this: https://www.oreilly.com/pub/pr/1042

    They call it Founder's Copyright. The also use Creative Commons. The goal is to make out of print books available at no cost.

    • card_zero 12 hours ago

      > A complete list of available titles is at www.oreilly.com/openbook

      Exciting!

      Follows link

      Link no longer exists, gets O'Reilly front page instead

      "Introducing the AI Academy, Help your entire org put GenAI to work"

      Thanks O'Reilly.

      • stvltvs 10 hours ago

        Looks like Openbook stuff is still there, just homeless. I had to do a web search to find it. For example:

        https://www.oreilly.com/openbook/make3/book/

      • ToucanLoucan 10 hours ago

        The original dream of the internet: Information, freely available to any who want it.

        The new dream of the internet: Some information, that aligns with the values of our advertisers, delivered via an LLM that sometimes makes shit up.

        • seanp2k2 5 minutes ago

          Yeah, each year we inch closer to an internet where the only things to do revolve around buying things; watch “content” which mostly revolves around creators shilling products, research products, or buy products. Every hobby has to be monetized now, everything has to be a side hustle, every impression monetized. Few seem to bother anymore with personal blogs that exist for their own enjoyment and sharing of knowledge, and yet with all this paid creation, full-time artists struggle more than ever, largely unable to afford living costs in the very cities they helped to build the culture and value of.

          I find it personally difficult to look at the entirety of the internet in 2024 and say that it’s definitely better for society than it was in 2004. I guess now at least we can mostly book appointments on our phones without having to speak with someone in real-time as they read dates and times off of a calendar interface that we can now just use ourselves directly.

      • MollyRealized 12 hours ago

        It's okay, I'll just check the Wayb--shit

    • microtherion 9 hours ago

      It's somewhat ironic that, while the individual books are still accessible, their index pages https://www.oreilly.com/free and https://www.oreilly.com/openbook both redirect to some AI propaganda these days, with no links to the books left.

      A third party page still has links to some (possibly all) of the books: https://zapier.com/blog/free-oreilly-press-books/

  • boramalper 24 minutes ago

    Of course someone needs to scan/digitise those books but for those which already are, there is Anna’s Archive.

    https://en.wikipedia.org/wiki/Anna%27s_Archive

  • svilen_dobrev 12 hours ago

    This seems to be the fate of knowledge/content that stays in institutions which have been built with the idea of collecting it and growing it.. but have turned into walled gardens/crypts of sort. Rot/Rust and be forgotten.

    A very cynical and dark view is that the New things/people need that oblivion in order to feel great, for not haveing to compare with old great-er ones. Rewriting history as it seems fit the current powers-that-be, is easier this way.

    Or may be it's just collective stupidity? or societal immaturity ?

    (i am coming from completely different killed project on a different continent, but the idea is the same)

    • shiroiushi 4 hours ago

      I seriously doubt there's very much highly relevant old knowledge locked away somewhere. Is there interesting stuff we don't have good access to? Sure, but mainly of interest to historians (pro or amateur). You're not likely to find the cure for cancer written down in some 1000-year-old book somewhere. And while a few people might really be interested in reading decades or centuries-old novels that weren't popular enough to be called "classics" now, the vast majority of people aren't going to find such stories about people in the distant past all that interesting.

      Of course, it's best to preserve past knowledge, but I think the idea that this is part of some kind of conspiracy to keep people buying new stuff is pretty silly. People are always going to want new stuff, as society grows and changes.

    • kyleee 11 hours ago

      I think you are on to something, people frequently don’t want to grapple with and understand what has been done before, they prefer to just wing it and move forward on their own.

    • SapporoChris 10 hours ago

      I am fairly certain there is more knowledge/content available to anyone in this century than last century or any century before it. But perhaps I have misread your comment.

  • xipho 12 hours ago

    A huge proportion of this corpus is found in the Hathi Trust (see https://www.hathitrust.org/the-collection/). We have had a grant to crawl and derive an index on it via their supercomputing resources. I'm sure they are looking to LLM proposals, though they are exceedingly careful about the copyright issues.

    https://www.hathitrust.org/

  • submeta 11 hours ago

    With library genesis, who needs Google Books anymore? I buy books physically to support the author/s and download an epub version from said site to my kindle. The physical books I hardly read, they are for my shelf. Although I love the feeling of printed books, but I read in bed, and it‘s easier to hold an ebook. Also I read when I commute. It’s lighter to have my Kindle Oasis with me with tons of books on it.

    • kccqzy 7 hours ago

      Someone needs to scan the book and upload it to library genesis. The article said Google had developed this massively efficient apparatus for scanning (or taking photographs of) books, and most of the article was about out-of-print books.

      I personally have actually tried to contribute to libgen a particular difficult-to-find-online book by buying it, scanning it, and uploading it. There need to be more people doing this.

    • ghaff 11 hours ago

      There’s the everything available online for free mindset. But, yes, I’ve basically donated all my books that were in the public domain. And, in general, have been massively purging my book collection of stuff I won’t realistically read again.

      • submeta 10 hours ago

        I do buy books, to support the authors. And I would encourage anyone to support the authors they like to read.

        • ASalazarMX 10 hours ago

          I agree, but also wouldn't lose sleep for pirating a book of an author that died more than 20 years ago, in most contexts.

  • thayne 12 hours ago

    IMO if a work is out of print (or equivalent depending on the medium) for more than a few years, it should be released into the public domain. Or maybe something like the public domain, but requires attribution.

    • kps 12 hours ago

      Like trademark: Use it or lose it.

      (The reality is that publishers would put lazy photocopies up for sale at ten zillion dollars a piece.)

    • eschneider 11 hours ago

      Have you dealt with publishers? If a work is out of print for a few years, much better to have rights revert to the creator.

      • WillAdams 8 hours ago

        Even that doesn't always work --- I was rebuffed by Joan Turville-Petre's son when I asked for a license to reprint his mother's notes on J.R.R. Tolkien's translation of _The Old English Exodus_ on the grounds that he would prefer to work with an academic, rather than an individual.

        Anyone know an academic specializing in Old English who would like to oversee this reprinting? I have a typeset PDF which only wants proofreading and updating of the index.

    • giraffe_lady 12 hours ago

      Then every book will be immediately out of print after its initial run, while the not-quite-a-cartel of publishers all decline to print it until it hits the point where they no longer have to pay the author.

      • Jtsummers 12 hours ago

        Then the publisher loses out on exclusive publishing rights and also loses money. It's in their interests to keep it in print so long as it's a profitable book, even if they have to pay some percentage to the author. Once it goes into public domain every publisher can reprint it and the original publisher has to compete with them on price.

        • jamiek88 10 hours ago

          > so long as it's a profitable book

          And here is the rub. You’ll end up with three or four super authors with the rest being ripped off.

          Much better for it to revert to the author in that situation IMO.

          • Jtsummers 10 hours ago

            I'm not arguing for it (or against it for that matter), I was just pointing out that the analysis in the comment I responded to didn't make sense. Every book won't be allowed to fall out of print and copyright just to exploit the authors because it would also hurt the publishers, they also benefit from exclusive publishing rights. Publishing rights are granted by the copyright holder (the author) to the publisher, much like patent licenses.

            Regarding unprofitable books, they'll fall out of print anyways because they're unprofitable. Those authors won't be getting ripped off because they won't be making money either way beyond initial commissions and what few sales they get.

            > Much better for it to revert to the author in that situation IMO.

            The publisher doesn't hold the copyright, the author does, so copyright (the particular right under discussion) can't revert to the author as it never left the author. What the publisher holds is publishing rights per a contract with the author. That could revert back to the author (or be voided or however it's structured), and that would be reasonable but we don't need any laws for it, that would fall under normal contract terms. Whether it's a common thing now or feasible for a particular author (with no clout? maybe not, with billions in sales from prior books? probably) is another matter.

        • giraffe_lady 12 hours ago

          ok

      • tap-snap-or-nap an hour ago

        Do we really need publishers anymore?

        • bloak an hour ago

          They are useful for quality control when the author is not well known.

    • pfdietz 11 hours ago

      So, e-books are either immediately out of print, or never out of print?

      • thayne an hour ago

        By "in print" I mean, the publisher is actively selling it.

        Although, if I were writing the law I would require selling DRM free ebooks for ebooks to count for maintaining the copyright.

      • tightbookkeeper 11 hours ago

        What if we applied the simple test that the book was originally published on paper and no other printings have occurred (digital or paper).

      • pessimizer 9 hours ago

        Never out of print. If there's an e-copy available to buy, that's better than millions of other books.

  • Animats 12 hours ago

    We need a Copyright Term Reduction Act.

    It's time. 50 years, renewal is possible but expensive.

    • mjevans 10 hours ago

      Just my opinion but as a starting point for the argument...

        * 20 years from date of first publish (renewable up to CAP? 50 years)
        * Must remain available every year
        * 10 year renewal blocks with massive registration fee increases
        * Compulsory maximum license fee cap (can offer for less) in the laws
      
      Note this is not TRADE MARK; trade marks are _consumer protection_ related to 'brand ownership'.
    • ASalazarMX 10 hours ago

      Even 50 is a lot, because it starts at the death of the author. Popular culture shouldn't remain locked out for generations. 50 maximum would be ideal, two generations from the one who experienced it in the original cultural context.

  • pvg 13 hours ago
  • DrNosferatu 3 hours ago

    I never seen an explicit mention if the Google Books corpus was indeed or not used for training LLMs…

    Anyone knows more about it?

  • senkora 12 hours ago

    I’m sure the lawyers will eventually figure out a way to train an LLM on them.

    • datadrivenangel 12 hours ago

      They probably already have! It seems like an amazing training dataset even if you can't share source data.

      • amelius 12 hours ago

        How do you train an LLM such that it is guaranteed to never regurgitate its training data?

        • ASalazarMX 10 hours ago

          You punish it if parts of the answer can be found in its training data, and reward it otherwise.

          • amelius 8 hours ago

            But the whole point of the training is that you reward it if it correctly reproduces the next token.

            • zeroxfe 6 hours ago

              That's not the whole point of the training. It's just (very loosely) a measure of loss used during pre-training. There are many post-training and alignment stages in a typical model that are designed to reward high-quality responses.

              Technically, yes, it's impossible to guarantee that it won't just regurgitate source material (which is mostly around the tails of the data distribution), but the whole point of training is to build generalized intelligence.

              • amelius 3 hours ago

                I guess I used the wrong wording but it doesn't change the argument. Yes, the whole point of training is to build generalized intelligence (or at least that's what we __hope__ for). But as far as I understand, we do it __mainly__ by training for the next word in the sequence.

                PS: you speak of "pre-training" and "post-training", so I'm curious what you think is the main part of the training (?)

  • carlosjobim 10 hours ago

    For Kagi users, I recommend putting books.google.com as a pinned domain. This way, you'll many times be presented with some of the best sources for any search query. Then it's a matter of finding the ePub file of that book. To read on MacOS, FBReader is a high quality app.

    • emmelaich 7 hours ago

      Thanks. Looks like it's available for Windows/Linux too. At last as of FBReader 2.1.2 30th September 2024.

  • anoncow 12 hours ago

    Sad and criminal.

  • 2OEH8eoCRo0 12 hours ago

    The tragedy is that Google is tasked with this at all. It would be cool if public libraries could work together on a massive public digital library. This shouldn't be Google's responsibility.

    • Jtsummers 12 hours ago

      Google wasn't tasked (by a third party) with this, they chose to do it.

      • ants_everywhere 11 hours ago

        arguably Google was invented to fund this project.

        The books project predates the search engine and the search engine grew out of the project of creating a universal digital library. The PageRank algorithm is one of a class of algorithms used to score citations in books and papers.

    • dredmorbius 6 hours ago

      HathiTrust was ... nearly this.

      Until it too was emasculated.

      <https://en.wikipedia.org/wiki/HathiTrust>

      Otherwise, we have Project Gutenberg (public domain), OpenLibrary (Internet Archive, both PD and copyrighted works), ZLibrary, Library Genesis, and Anna's Archive.

    • NoMoreNicksLeft 11 hours ago

      All humans everywhere have a responsibility to preserve culture and knowledge to the best of their ability. I think what you meant to say is that none of us can trust Google with this important task.

  • andrewstuart 12 hours ago

    Google must be tempted to put them in an LLM.

    • bborud 12 hours ago

      It would surprise me greatly if they haven't already.

      • johnobrien1010 7 hours ago

        Another reason that they should never have been allowed to ingest all the books in the first place. Without paying for the rights to use the digital form of the book, a use which is explicitly prohibited by the publisher, they digitized the books anyway. If they used it to train an LLM, and the LLM regurgitates near facsimiles of all the copyrighted works without compensation to the original rights holders, that seems like something that should be illegal.