• General_Effort@lemmy.world
    link
    fedilink
    English
    arrow-up
    51
    arrow-down
    3
    ·
    15 hours ago

    What did he think a crawler is? Why was he surprised that not allowing companies to use his data lead to them not using his data? Looks like he has another surprise coming when he notices that search engines no longer index his blog.

    • Archr@lemmy.world
      link
      fedilink
      English
      arrow-up
      7
      arrow-down
      1
      ·
      edit-2
      12 hours ago

      I feel like most casual users would not make the connection of “crawlers” to link previews that they talk about it the article.

      Sure, if you understand that robots.txt includes all robots then sure. But that is not how general news media has been talking about robots.txt.

      • General_Effort@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        11 hours ago

        that is not how general news media has been talking about robots.txt.

        Ahh, yes. I think there is a lesson there.

  • thedruid@lemmy.world
    link
    fedilink
    English
    arrow-up
    31
    ·
    15 hours ago

    So. If I can add something here for everyone’s benefit

    No search engine really obeys robots.txt

    Their publicly acknowledged crawlers do, but they have other crawlers that aren’t know that ignore the file.

    Google knows every inch of your site, allowed or not.

    See, just because a search engine says it doesn’t know, doesn’t mean it hasn’t crawled. Just doesn’t display the results based on your settings.

    • ell1e@leminal.space
      link
      fedilink
      English
      arrow-up
      24
      arrow-down
      1
      ·
      edit-2
      12 hours ago

      Often it is respected, but the resulting problem is platforms conflate things with the questionable AI scraping crawlers to blackmail websites into participating in feeding AI.

      For example, Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI. Edit: see https://arstechnica.com/tech-policy/2025/07/cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-wont/ as source. I imagine LinkedinBot, given it’s microsoft, will feed some other AI of theirs as well on top of the previews.

      Until regulation steps in to require AI bots to separately ask for crawling permission, or to actually get a proper license for reuse of the contents, this situation isn’t going to improve.

      • General_Effort@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        7
        ·
        15 hours ago

        Googlebot if enabled won’t just list you for search, but will also scrape your contents for Google’s AI.

        False.

        • cecilkorik@lemmy.ca
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          1
          ·
          14 hours ago

          Absolutely true. They’ll buy the data they want from some shitty crawler running from some data broker in some far-flung and lawless part of the world, hallucinate the actual source, and pretend they had no idea their “data partner” wasn’t respecting robots.txt if they have to, which they won’t ever have to do because it’s literally impossible to detect and prove and realistically unenforceable.

          This is a company that removed it’s company motto of “Don’t be evil” because it found it too “limiting”. Don’t be naive.

          • General_Effort@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            ·
            13 hours ago

            That’s very different from what I called false.

            What you describe may happen, but probably not as much as you think. Much of that stuff is just not that valuable. Some personal, colloquial writing is necessary, but Google already pays Reddit. Other stuff is better obtained from torrents or shadow libraries like Anna’s Archive.