• ell1e@leminal.space
    link
    fedilink
    English
    arrow-up
    25
    arrow-down
    1
    ·
    edit-2
    13 hours ago

    Often it is respected, but the resulting problem is platforms conflate things with the questionable AI scraping crawlers to blackmail websites into participating in feeding AI.

    For example, Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI. Edit: see https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ as source. I imagine LinkedinBot, given it’s microsoft, will feed some other AI of theirs as well on top of the previews.

    Until regulation steps in to require AI bots to separately ask for crawling permission, or to actually get a proper license for reuse of the contents, this situation isn’t going to improve.

    • General_Effort@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      7
      ·
      16 hours ago

      Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.

      False.

          • ell1e@leminal.space
            link
            fedilink
            English
            arrow-up
            3
            ·
            edit-2
            12 hours ago

            So what’s the quote from the documentation that backs up your claim? The line “perform other product specific crawls” seems extremely vague by design.

            • General_Effort@lemmy.world
              link
              fedilink
              English
              arrow-up
              2
              ·
              12 hours ago

              I’m not really sure what you are asking here. Did you notice that you can scroll down and see a list of their crawlers?

              • ell1e@leminal.space
                link
                fedilink
                English
                arrow-up
                2
                ·
                edit-2
                10 hours ago

                Nothing on this page seems to contradict the article. But if I simply missed the part that does, I’d be happy to learn.

                • General_Effort@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  2
                  ·
                  9 hours ago

                  You look up what Googlebot does. No AI.

                  You want to know what crawlers do AI? Just search for “AI”, or “training”, or some such, or skim through. It’s not long. Google-Extended collects training data. Note that Google-Extended is explicitly not used to rank pages.

                  Did that help?

                  • ell1e@leminal.space
                    link
                    fedilink
                    English
                    arrow-up
                    2
                    ·
                    edit-2
                    2 hours ago

                    You look up what Googlebot does. No AI.

                    The page seems written to perhaps suggest it but doesn’t explicitly say the other bots can’t feed into some other sort of AI training. It would be in Google’s interest to mislead the users here.

                    Edit: I found a quote where it says Googlebot does both in one: “Google-Extended doesn’t have a separate HTTP request user agent string. Crawling is done with existing Google user agent […]” and I guess Cloudflare doesn’t trust Google to abide by the access controls. That seems sensible to me.

      • cecilkorik@lemmy.ca
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        1
        ·
        15 hours ago

        Absolutely true. They’ll buy the data they want from some shitty crawler running from some data broker in some far-flung and lawless part of the world, hallucinate the actual source, and pretend they had no idea their “data partner” wasn’t respecting robots.txt if they have to, which they won’t ever have to do because it’s literally impossible to detect and prove and realistically unenforceable.

        This is a company that removed it’s company motto of “Don’t be evil” because it found it too “limiting”. Don’t be naive.

        • General_Effort@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          ·
          14 hours ago

          That’s very different from what I called false.

          What you describe may happen, but probably not as much as you think. Much of that stuff is just not that valuable. Some personal, colloquial writing is necessary, but Google already pays Reddit. Other stuff is better obtained from torrents or shadow libraries like Anna’s Archive.