partial answer: the major labs (Anthropic, OpenAI) do respect robots.txt for their named crawlers, so blocking ClaudeBot/GPTBot in robots.txt works for those specific bots. What you can't easily opt out of is the indirect ingestion via Common Crawl, scraped datasets, and unnamed crawlers. agents.txt doesn't change that picture.
The Allow-Training vs Allow-RAG split in the default is the useful part of the file. They're different operations with different costs to the site owner. Training is a one-time bulk ingest. RAG is a runtime fetch per query. A site owner might reasonably allow one and not the other.
I can report that Facebook does not respect robots.txt. Heck, I even mailed domain@fb.com with the specific IP ranges and log samples three times over a month and they of did not even respond. Keeps on wasting my CPU cycles to this day by crawling massive development forks (I hope they choke on the data...):
Add HTTP Basic Auth in front of your website, then share the credentials with people who are allowed to view your website. Make sure you don't hand our credentials to employees of OpenAI, Anthropic, xAI or Microsoft.
Part of the Managed VPS Hosting package, I guess
Do any agents respect agents.txt?
Is there a way to opt my websites out of ai data collection?
Well Claude still thinks it shouldn't read AGENTS.md [1] so they probably also don't care about agents.txt on a web server...
[1] https://github.com/anthropics/claude-code/issues/6235
Any measure you put in place can/will be ignored by the actors who never planned to respect your wishes in the first place.
That's just how the web works, though.
This is true for measures that require the actor to respect your wishes, but doesn't apply to measures that force them to.
partial answer: the major labs (Anthropic, OpenAI) do respect robots.txt for their named crawlers, so blocking ClaudeBot/GPTBot in robots.txt works for those specific bots. What you can't easily opt out of is the indirect ingestion via Common Crawl, scraped datasets, and unnamed crawlers. agents.txt doesn't change that picture. The Allow-Training vs Allow-RAG split in the default is the useful part of the file. They're different operations with different costs to the site owner. Training is a one-time bulk ingest. RAG is a runtime fetch per query. A site owner might reasonably allow one and not the other.
I can report that Facebook does not respect robots.txt. Heck, I even mailed domain@fb.com with the specific IP ranges and log samples three times over a month and they of did not even respond. Keeps on wasting my CPU cycles to this day by crawling massive development forks (I hope they choke on the data...):
About three hits per second for months now.Add HTTP Basic Auth in front of your website, then share the credentials with people who are allowed to view your website. Make sure you don't hand our credentials to employees of OpenAI, Anthropic, xAI or Microsoft.