Some thoughts on how useful Anubis really is. Combined with comments I read elsewhere about scrapers starting to solve the challenges, I’m afraid Anubis will be outdated soon and we need something else.

  • mfed1122@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    5
    arrow-down
    11
    ·
    edit-2
    7 hours ago

    Yeah, well-written stuff. I think Anubis will come and go. This beautifully demonstrates and, best of all, quantifies the negligence of Anubis.

    It’s very interesting to try to think of what would work, even conceptually. Some sort of purely client-side captcha type of thing perhaps. I keep thinking about it in half-assed ways for minutes at a time.

    Maybe something that scrambles the characters of the site according to some random “offset” of some sort, e.g maybe randomly selecting a modulus size and an offset to cycle them, or even just a good ol’ cipher. And the “captcha” consists of a slider that adjusts the offset. You as the viewer know it’s solved when the text becomes something sensical - so there’s no need for the client code to store a readable key that could be used to auto-undo the scrambling. You could maybe even have some values of the slider randomly chosen to produce English text if the scrapers got smart enough to check for legibility (not sure how to hide which slider positions would be these red herring ones though) - which could maybe be enough to trick the scraper into picking up junk text sometimes.

    • Jade@programming.dev
      link
      fedilink
      English
      arrow-up
      8
      ·
      3 hours ago

      That kind of captcha is trivial to bypass via frequency analysis. Text that looks like language, as opposed to random noise, is very statistically recognisable.

    • drkt@scribe.disroot.org
      link
      fedilink
      English
      arrow-up
      11
      ·
      6 hours ago

      That type of captcha already exists. I don’t know about their specific implementation, but 4chan has it, and it is trivially bypassed by userscripts.

    • Guillaume Rossolini@infosec.exchange
      link
      fedilink
      arrow-up
      3
      ·
      6 hours ago

      @mfed1122 @tofu any client-side tech to avoid (some of the) bots is bound to, as its popularity grows, be either circumvented by the bot’s developers or the model behind the bot will have picked up enough to solve it

      I don’t see how any of these are going to do better than a short term patch

      • rtxn@lemmy.world
        link
        fedilink
        English
        arrow-up
        5
        ·
        edit-2
        4 hours ago

        That’s the great thing about Anubis: it’s not client-side. Not entirely anyways. Similar to public key encryption schemes, it exploits the computational complexity of certain functions to solve the challenge. It can’t just say “solved, let me through” because the client has to calculate a number, based on the parameters of the challenge, that fits certain mathematical criteria, and then present it to the server. That’s the “proof of work” component.

        A challenge could be something like “find the two prime factors of the semiprime 1522605027922533360535618378132637429718068114961380688657908494580122963258952897654000350692006139”. This number is known as RSA-100, it was first factorized in 1991, which took several days of CPU time, but checking the result is trivial since it’s just integer multiplication. A similar semiprime of 260 decimal digits still hasn’t been factorized to this day. You can’t get around mathematics, no matter how advanced your AI model is.

        • Guillaume Rossolini@infosec.exchange
          link
          fedilink
          arrow-up
          1
          arrow-down
          5
          ·
          4 hours ago

          @rtxn I don’t understand how that isn’t client side?

          Anything that is client side can be, if not spoofed, then at least delegated to a sub process, and my argument stands

          • Passerby6497@lemmy.world
            link
            fedilink
            English
            arrow-up
            4
            ·
            3 hours ago

            Please, explain to us how you expect to spoof a math problem that you have to provide an answer to the server before proceeding.

            You can math all you want on the client, but the server isn’t going to give you shit until you provide the right answer.

              • Passerby6497@lemmy.world
                link
                fedilink
                English
                arrow-up
                2
                ·
                2 hours ago

                You’re given the challenge to solve by the server, yes. But just because the challenge is provided to you, that doesn’t mean you can fake your way through it.

                You still have to calculate the answer before you can get any farther. You can’t bullshit/spoof your way through the math problem to bypass it, because your correct answer is required to proceed.

                There is no way around this, is there?

                Unless the server gives you a well-known problem you have the answer to/is easily calculated, or you find a vulnerability in something like Anubis to make it accept a wrong answer, not really. You’re stuck at the interstitial page with a math prompt until you solve it.

                Unless I’m misunderstanding your position, I’m not sure what the disconnect is. The original question was about spoofing the challenge client side, but you can’t really spoof the answer to a complicated math problem unless there’s an issue with the server side validation.

                  • zalgotext@sh.itjust.works
                    link
                    fedilink
                    English
                    arrow-up
                    1
                    ·
                    1 hour ago

                    LLMs can’t just run chromium unless they’re tool aware and have an agent running alongside them to facilitate tool use. I highly suspect that AI web crawlers aren’t that sophisticated.

                  • Badabinski@kbin.earth
                    link
                    fedilink
                    arrow-up
                    3
                    ·
                    2 hours ago

                    Anubis has worked if that’s happening. The point is to make it computationally expensive to access a webpage, because that’s a natural rate limiter. It kinda sounds like it needs to be made more computationally expensive, however.

          • rtxn@lemmy.world
            link
            fedilink
            English
            arrow-up
            4
            ·
            edit-2
            4 hours ago

            It’s not client-side because validation happens on the server side. The content won’t be displayed until and unless the server receives a valid response, and the challenge is formulated in such a way that calculating a valid answer will always take a long time. It can’t be spoofed because the server will know that the answer is bullshit. In my example, the server will know that the prime factors returned by the client are wrong because their product won’t be equal to the original semiprime. Delegating to a sub-process won’t work either, because what’s the parent process supposed to do? Move on to another piece of content that is also protected by Anubis?

            The point is to waste the client’s time and thus reduce the number of requests the server has to handle, not to prevent scraping altogether.

            • Guillaume Rossolini@infosec.exchange
              link
              fedilink
              arrow-up
              1
              ·
              2 hours ago

              @rtxn validation of what?

              This is a typical network thing: client asks for resource, server says here’s a challenge, client responds or doesn’t, has the correct response or not, but has the challenge regardless

              • rtxn@lemmy.world
                link
                fedilink
                English
                arrow-up
                1
                ·
                1 hour ago

                THEN (and this is the part you don’t seem to understand) the client process has to waste time solving the challenge, which is, by the way, orders of magnitudes lighter on the server than serving the actual meaningful content, or cancel the request. If a new request is sent during that time, it will still have to waste time solving the challenge. The scraper will get through eventually, but the challenge delays the response and reduces the load on the server because while the scrapers are busy computing, it doesn’t have to serve meaningful content to them.