Recently, I wrote an article about my journey in learning about robots.txt and its implications on the data rights in regards to what I write in my blog. I was confident that I wanted to ban all the crawlers from my website. Turned out there was an unintended consequence that I did not account for.
My LinkedIn posts became broken Ever since I changed my robots.txt file, I started seeing that my LinkedIn posts no longer had the preview of the article available. I was not sure what the issue was initially, since before then it used to work just fine. In addition to that, I have noticed that LinkedIn’s algorithm has started serving my posts to fewer and fewer connections. I was a bit confused by the issue, thinking that it might have been a temporary problem. But over the next two weeks the missing post previews did not appear.
You look up what Googlebot does. No AI.
You want to know what crawlers do AI? Just search for “AI”, or “training”, or some such, or skim through. It’s not long. Google-Extended collects training data. Note that Google-Extended is explicitly not used to rank pages.
Did that help?
The page seems written to perhaps suggest it but doesn’t explicitly say the other bots can’t feed into some other sort of AI training. It would be in Google’s interest to mislead the users here.
Edit: I found a quote where it says Googlebot does both in one: “Google-Extended doesn’t have a separate HTTP request user agent string. Crawling is done with existing Google user agent […]” and I guess Cloudflare doesn’t trust Google to abide by the access controls. That seems sensible to me.