The Nonprofit Doing the AI Industry's Dirty Work

(theatlantic.com)

9 points | by kgwgk 12 hours ago ago

2 comments

  • Aloisius 9 hours ago

    Calling archiving the web for researchers dirty work is a bit much.

    Unless something has changed since I was there, the crawler didn't intentionally bypass any paywalls.

    The crawler obeyed robots.txt, throttled itself when visiting slow sites to avoid overloading them and announced its user agent clearly with a URL explaining what it was and how to block it if desired.

  • zeech 11 hours ago

    [dead]