Why are anime catgirls blocking my access to the Linux kernel?

tofu@lemmy.nocturnal.garden · 6 months ago

Why are anime catgirls blocking my access to the Linux kernel?

mfed1122@discuss.tchncs.de · edit-2 5 months ago

Yeah, well-written stuff. I think Anubis will come and go. This beautifully demonstrates and, best of all, quantifies the ~~negligence~~ negligible cost to scrapers of Anubis.

It’s very interesting to try to think of what would work, even conceptually. Some sort of purely client-side captcha type of thing perhaps. I keep thinking about it in half-assed ways for minutes at a time.

Maybe something that scrambles the characters of the site according to some random “offset” of some sort, e.g maybe randomly selecting a modulus size and an offset to cycle them, or even just a good ol’ cipher. And the “captcha” consists of a slider that adjusts the offset. You as the viewer know it’s solved when the text becomes something sensical - so there’s no need for the client code to store a readable key that could be used to auto-undo the scrambling. You could maybe even have some values of the slider randomly chosen to produce English text if the scrapers got smart enough to check for legibility (not sure how to hide which slider positions would be these red herring ones though) - which could maybe be enough to trick the scraper into picking up junk text sometimes.

Jade@programming.dev · 6 months ago

That kind of captcha is trivial to bypass via frequency analysis. Text that looks like language, as opposed to random noise, is very statistically recognisable.

Possibly linux@lemmy.zip · 5 months ago

Not to mention it relies on security though obscurity

It wouldn’t be that hard to figure out and bypass

drkt@scribe.disroot.org · 6 months ago

That type of captcha already exists. I don’t know about their specific implementation, but 4chan has it, and it is trivially bypassed by userscripts.

Possibly linux@lemmy.zip · 5 months ago

Anubis is more of a economic solution. It doesn’t stop bots but it does make companies pay more to access content instead of having server operators foot the bill.

dabe@lemmy.zip · 6 months ago

I’m sure you meant to sound more analytical than anything… but this really comes off as arrogant.

You make the claim that Anubis is negligent and come and go, and then admit ton only spending minutes at a time thinking of solutions yourself, which you then just sorta spout. It’s fun to think about solutions to this problem collectively, but can you honestly believe that Anubis is negligent when it’s so clearly working and when the author has been so extremely clear about their own perception of its pitfalls and hasty development (go read their blog, it’s a fun time).

mfed1122@discuss.tchncs.de · edit-2 5 months ago

By negligence, I meant that the cost is negligible to the companies running scrapers, not that the solution itself is negligent. I should have said “negligibility” of Anubis, sorry - that was poor clarity on my part.

But I do think that the cost of it is indeed negligible, as the article shows. It doesn’t really matter if the author is biased or not, their analysis of the costs seems reasonable. I would need a counter-argument against that to think they were wrong. Just because they’re biased isn’t enough to discount the quantification they attempted to bring to the debate.

Also, I don’t think there’s any hypocrisy in me saying I’ve only thought about other solutions here and there - I’m not maintaining an anti-scraping library. And there’s already been indications that scrapers are just accepting the cost of Anubis on Codeberg, right? So I’m not trying to say I’m some sort of tech genius who has the right idea here, but from what Codeberg was saying, and from the numbers in this article, it sure looks like Anubis isn’t the right idea. I am indeed only having fun with my suggestions, not making whole libraries out of them and pronouncing them to be solutions. I personally haven’t seen evidence that Anubis is so clearly working? As the author points out, it seems like it’s only working right now because of how new it is, but if scrapers want to go through it, they easily can - which puts us in a sort of virus/antibiotic eternal war of attrition. And if course that is the case with many things in computing as well. So I guess my open wondering are just about if there’s ever any way to develop a countermeasure that the scrapers won’t find “worth it” to force through?

Edit for tone clarity: I’m don’t want to be antagonistic, rude, or hurtful in any way. Just trying to have a discussion and understand this situation. Perhaps I was arrogant, if so I apologize. It was also not my intent, fwiw. Also, thanks for helping me understand why I was getting downvoted. I intended my post to just be constructive spitballing about what I see as the eventual inevitable weakness in Anubis. I think it’s a great project and it’s great that people are getting use out of it even temporarily, and of course the devs deserve lots of respect for making the thing. But as much as I wish I could like it and believe it will solve the problem, I still don’t think it will.

dabe@lemmy.zip · 5 months ago

Well I can agree on the fact that the arms race situation we’re in sucks. It’s an old problem, seen in malware attacks and defenses. I’m just glad we have people fighting on our side in their spare time :’)

And it’s all good on the tone, thank you for your clarifications

Guillaume Rossolini@infosec.exchange · 6 months ago

@mfed1122 @tofu any client-side tech to avoid (some of the) bots is bound to, as its popularity grows, be either circumvented by the bot’s developers or the model behind the bot will have picked up enough to solve it

I don’t see how any of these are going to do better than a short term patch

rtxn@lemmy.world · edit-2 6 months ago

That’s the great thing about Anubis: it’s not client-side. Not entirely anyways. Similar to public key encryption schemes, it exploits the computational complexity of certain functions to solve the challenge. It can’t just say “solved, let me through” because the client has to calculate a number, based on the parameters of the challenge, that fits certain mathematical criteria, and then present it to the server. That’s the “proof of work” component.

A challenge could be something like “find the two prime factors of the semiprime 1522605027922533360535618378132637429718068114961380688657908494580122963258952897654000350692006139”. This number is known as RSA-100, it was first factorized in 1991, which took several days of CPU time, but checking the result is trivial since it’s just integer multiplication. A similar semiprime of 260 decimal digits still hasn’t been factorized to this day. You can’t get around mathematics, no matter how advanced your AI model is.

Guillaume Rossolini@infosec.exchange · 6 months ago

@rtxn I don’t understand how that isn’t client side?

Anything that is client side can be, if not spoofed, then at least delegated to a sub process, and my argument stands

Passerby6497@lemmy.world · 6 months ago

Please, explain to us how you expect to spoof a math problem that you have to provide an answer to the server before proceeding.

You can math all you want on the client, but the server isn’t going to give you shit until you provide the right answer.

Guillaume Rossolini@infosec.exchange · 6 months ago

@Passerby6497 I really don’t understand the issue here

If there is a challenge to solve, then the server has provided that to the client

There is no way around this, is there?

Passerby6497@lemmy.world · 6 months ago

You’re given the challenge to solve by the server, yes. But just because the challenge is provided to you, that doesn’t mean you can fake your way through it.

You still have to calculate the answer before you can get any farther. You can’t bullshit/spoof your way through the math problem to bypass it, because your correct answer is required to proceed.

There is no way around this, is there?

Unless the server gives you a well-known problem you have the answer to/is easily calculated, or you find a vulnerability in something like Anubis to make it accept a wrong answer, not really. You’re stuck at the interstitial page with a math prompt until you solve it.

Unless I’m misunderstanding your position, I’m not sure what the disconnect is. The original question was about spoofing the challenge client side, but you can’t really spoof the answer to a complicated math problem unless there’s an issue with the server side validation.

Guillaume Rossolini@infosec.exchange · 6 months ago

@Passerby6497 my stance is that the LLM might recognize that the best way to solve the problem is to run chromium and get the answer from there, then pass it on?

Badabinski@kbin.earth · 6 months ago

Anubis has worked if that’s happening. The point is to make it computationally expensive to access a webpage, because that’s a natural rate limiter. It kinda sounds like it needs to be made more computationally expensive, however.

Passerby6497@lemmy.world · 6 months ago

Congrats on doing it the way the website owner wants! You’re now into the content, and you had to waste seconds of processing power to do so (effectively being throttled by the owner), so everyone is happy. You can’t overload the site, but you can still get there after a short wait.

dabe@lemmy.zip · 6 months ago

That solution still introduces lots of friction. At the volume and rate that these bots want to be traversing the internet, they probably don’t want to be fully graphically rendering pages and spawning extra browser processes then doing text recognition to then pass on to the LLM training sets. Maybe I’m wrong there, but I don’t think it’s that simple and actually just shifts solving the math challenge horizontally (i.e., in both cases, the scraper or the network the scraper is running on still has to solve the challenge)

zalgotext@sh.itjust.works · 6 months ago

LLMs can’t just run chromium unless they’re tool aware and have an agent running alongside them to facilitate tool use. I highly suspect that AI web crawlers aren’t that sophisticated.

rtxn@lemmy.world · edit-2 6 months ago

It’s not client-side because validation happens on the server side. The content won’t be displayed until and unless the server receives a valid response, and the challenge is formulated in such a way that calculating a valid answer will always take a long time. It can’t be spoofed because the server will know that the answer is bullshit. In my example, the server will know that the prime factors returned by the client are wrong because their product won’t be equal to the original semiprime. Delegating to a sub-process won’t work either, because what’s the parent process supposed to do? Move on to another piece of content that is also protected by Anubis?

The point is to waste the client’s time and thus reduce the number of requests the server has to handle, not to prevent scraping altogether.

Guillaume Rossolini@infosec.exchange · 6 months ago

@rtxn validation of what?

This is a typical network thing: client asks for resource, server says here’s a challenge, client responds or doesn’t, has the correct response or not, but has the challenge regardless

rtxn@lemmy.world · 6 months ago

THEN (and this is the part you don’t seem to understand) the client process has to waste time solving the challenge, which is, by the way, orders of magnitudes lighter on the server than serving the actual meaningful content, or cancel the request. If a new request is sent during that time, it will still have to waste time solving the challenge. The scraper will get through eventually, but the challenge delays the response and reduces the load on the server because while the scrapers are busy computing, it doesn’t have to serve meaningful content to them.

Guillaume Rossolini@infosec.exchange · 6 months ago

@rtxn all right, that’s all you had to say initially, rather than try convincing me that the network client was out of the loop: it isn’t, that’s the whole point of Anubis

rtxn@lemmy.world · edit-2 6 months ago

With how much authority you wrote with before, I thought you’d be able to grasp the concept. I’m sorry I assumed better.

mfed1122@discuss.tchncs.de · 5 months ago

Yeah, you’re absolutely right and I agree. So then do we have to resign the situation to being an eternal back-and-forth of just developing random new challenges every time the scrapers adapt to them? Like antibiotics for viruses? Maybe that is the way it is. And honestly that’s what I suspect. But Anubis feels so clever and so close to something that would work. The concept of making it about a cost that adds up, so that it intrinsically only effects massive processes significantly, is really smart…since it’s not about coming up with a challenge a computer can’t complete, but just a challenge that makes it economically not worth it to complete. But it’s disappointing to see that, at least with the current wait times, it doesn’t seem like it will cost enough to dissuade scrapers. And worse, the cost is so low that it seems like making the cost significant to the scrapers will require really insufferable wait times for users.

Guillaume Rossolini@infosec.exchange · 5 months ago

@mfed1122 yeah that is my worry, what’s an acceptable wait time for users? A tenth of a second is usually not noticeable to a human, but is it useful in this context? What about half a second, etc

I don’t know that I want a web where everything is artificially slowed by a full second for each document