I was wrong about robots.txt

KarlHeinzSchwuke@feddit.org · 1 month ago

I was wrong about robots.txt

General_Effort@lemmy.world · 1 month ago

Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.

False.

ell1e@leminal.space · edit-2 30 days ago

See here: https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ If you have a source that says it’s false, I’d be curious.

General_Effort@lemmy.world · edit-2 29 days ago

Ok. That quotes a tweet by Cloudflare’s CEO. IDK what his qualifications are, but his conflict of interest is obvious enough. Real quality journalism there.

ETA: I looked at what the Cloudflare CEO said again. To be fair to him, he is not actually claiming that Googlebot collects AI training data. He’s talking about the AI overview, which is a search feature. The data for search features is collected by Googlebot. I’m not sure why someone would want their link listed in search but not appear much more prominently in the AI overview.

Here’s Google technical documentation on its crawlers: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers

ell1e@leminal.space · edit-2 30 days ago

So what’s the quote from the documentation that backs up your claim? The line “perform other product specific crawls” seems extremely vague by design.

General_Effort@lemmy.world · 30 days ago

I’m not really sure what you are asking here. Did you notice that you can scroll down and see a list of their crawlers?

ell1e@leminal.space · edit-2 30 days ago

Nothing on this page seems to contradict the article. But if I simply missed the part that does, I’d be happy to learn.

General_Effort@lemmy.world · 30 days ago

You look up what Googlebot does. No AI.

You want to know what crawlers do AI? Just search for “AI”, or “training”, or some such, or skim through. It’s not long. Google-Extended collects training data. Note that Google-Extended is explicitly not used to rank pages.

Did that help?

ell1e@leminal.space · edit-2 29 days ago

You look up what Googlebot does. No AI.

The page seems written to perhaps suggest it but doesn’t explicitly say the other bots can’t feed into some other sort of AI training. It would be in Google’s interest to mislead the users here.

Edit: I found a quote where it says Googlebot does both in one: “Google-Extended doesn’t have a separate HTTP request user agent string. Crawling is done with existing Google user agent […]” and I guess Cloudflare doesn’t trust Google to abide by the access controls. That seems sensible to me. Edit 2: What exactly the CEO believes was perhaps rightfully disputed below, it was just my guess.

General_Effort@lemmy.world · 29 days ago

It would be a lot to write, if you had to say what something does not do rather than what it does.

I looked at what the Cloudflare CEO said again. To be fair to him, he is not actually backing you up. He’s saying that Google makes no difference between the AI overview and the other search results. That is true. The AI overview is a search feature. I’m not sure why someone would want their link listed in search but not appear much more prominently in the AI overview.

cecilkorik@lemmy.ca · 1 month ago

Absolutely true. They’ll buy the data they want from some shitty crawler running from some data broker in some far-flung and lawless part of the world, hallucinate the actual source, and pretend they had no idea their “data partner” wasn’t respecting robots.txt if they have to, which they won’t ever have to do because it’s literally impossible to detect and prove and realistically unenforceable.

This is a company that removed it’s company motto of “Don’t be evil” because it found it too “limiting”. Don’t be naive.

General_Effort@lemmy.world · 30 days ago

That’s very different from what I called false.

What you describe may happen, but probably not as much as you think. Much of that stuff is just not that valuable. Some personal, colloquial writing is necessary, but Google already pays Reddit. Other stuff is better obtained from torrents or shadow libraries like Anna’s Archive.