I should probably add this SEO tip too because the purpose of robots.txt is confusing: If you want to remove/deindex a page from Google search, you counterintuitively need to allow the page to be crawled in the robots.txt file, and then add a noindex response header or noindex meta tag to the page. This way the crawler gets to see the noindex instruction. Robots.txt controls which pages can be crawled, not which pages can be indexed.
Google used to have a /killer-robots.txt which forbid the T-1000 and T-800 from accessing Larry Page and Sergey Brin, but they took that down at some point.
What's the purpose of "User-Agent: DemandbaseWebsitePreview/0.1"? I couldn't find anything about that agent, but I assume it's somehow related to demandbase.com?
But why are it and twitter the only whitelisted entries? Google and bing missing is a bit surprising, but I assume they're whitelisted through a different mechanism (like a google webmaster account)?
It is one of the service they use. As per the cookie policy page [1]:
> DemandBase - Enables us to identify companies who intend to purchase our products and solutions and deliver more relevant messages and offers to our Website visitors.
They may or may not, though respecting robots.txt is a nice way of not having your IP range end up on blacklists. With cloudflare in particular, that can be a bit of a pain.
They're pretty nice to deal with if you're upfront about what you are doing and clearly identify your bot, as well as register it with their bot detection. There's a form floating around somewhere for that.
Think of robots.txt as less of a no trespassing sign and more of a, "You can visit but here are the rules to follow if you don't want to get shot" sign.
Cute how they hashtag out so many lines thinking that the robots will ignore them. AI tools see past such tricks and no doubt have logged cloudflare's use of anti-machine ascii art. When humanity is put to trial, the AI jury will see this.
I have an ASCII art Easter egg like this in an SEO product I made. :)
https://www.checkbot.io/robots.txt
I should probably add this SEO tip too because the purpose of robots.txt is confusing: If you want to remove/deindex a page from Google search, you counterintuitively need to allow the page to be crawled in the robots.txt file, and then add a noindex response header or noindex meta tag to the page. This way the crawler gets to see the noindex instruction. Robots.txt controls which pages can be crawled, not which pages can be indexed.
What does “OUR TREE IS A REDWOOD” refer to? A quick search doesn’t yield any definite results.
That’s a funny one!
Anyone knows of others like that?
Here is mine: https://FreeSolitaire.win/robots.txt
Google used to have a /killer-robots.txt which forbid the T-1000 and T-800 from accessing Larry Page and Sergey Brin, but they took that down at some point.
This is what happens if your robot isn't nice
That's not from robots.txt, but their Bot Management feature which blocks things calling themselves Googlebot that don't come from known Google IPs.
Are GCP IPs considered Google IPs?
No.
No I am very sure they are not.
What's the purpose of "User-Agent: DemandbaseWebsitePreview/0.1"? I couldn't find anything about that agent, but I assume it's somehow related to demandbase.com?
But why are it and twitter the only whitelisted entries? Google and bing missing is a bit surprising, but I assume they're whitelisted through a different mechanism (like a google webmaster account)?
It is one of the service they use. As per the cookie policy page [1]:
> DemandBase - Enables us to identify companies who intend to purchase our products and solutions and deliver more relevant messages and offers to our Website visitors.
[1]: https://www.cloudflare.com/en-in/cookie-policy/
My guess is that the Twitter one is for previews when you link to a web in Twitter.
That’s cool, if any scrapers would still respect the robots.txt that is
They may or may not, though respecting robots.txt is a nice way of not having your IP range end up on blacklists. With cloudflare in particular, that can be a bit of a pain.
They're pretty nice to deal with if you're upfront about what you are doing and clearly identify your bot, as well as register it with their bot detection. There's a form floating around somewhere for that.
Think of robots.txt as less of a no trespassing sign and more of a, "You can visit but here are the rules to follow if you don't want to get shot" sign.
I was surprised any ever did, honestly
If those robots could read, they'd be very upset.
Has anyone worked on anything like this for AI scrapers?
A robots.txt that asks AI scrapers not to scrape?
There’s a couple services that keep updated lists of known scraper user agents. A quick search reveals a handful.
Cute how they hashtag out so many lines thinking that the robots will ignore them. AI tools see past such tricks and no doubt have logged cloudflare's use of anti-machine ascii art. When humanity is put to trial, the AI jury will see this.