Reasoning Models Reason Well, Until They Don't

RSS Bot · 21 days ago

Reasoning Models Reason Well, Until They Don't

Lvxferre [he/him]@mander.xyz · edit-2 21 days ago

I don’t want to be that guy, but… no, wait, I am that guy.

No current model reasons. Not even “reasoning” models - it’s just yet another misleading analogy¹, to make you believe it has better capabilities than it does.

At the end of the day, what they do is a more complex version of predicting what it should output for the next chunk of word, based on what is present in the data it processed (was “trained” with) plus some weighting². This is good enough to emulate reasoning in some cases, but it is still not reasoning.

Some muppet might say “ackshyually emulating it and having it is the same thing lol”, or “I don’t know if I’m just emulating reasoning lmao”, as if the issue was just mental masturbation. Not really, it’s a practical matter - reasoning is a requirement to reliably reach correct conclusions based on correct premises, and the emulation is not perfect, so where the emulation breaks the results become unreliable. In other words: the model will babble nonsense³ where the emulation fails.

For example, consider multiplications. If you correctly follow the reasoning behind multiplications, it doesn’t really matter if you’re multiplying numbers with two, 20 or even 2000 digits each - you’ll consistently reach the right result. However, if you’re simply emulating the reasoning behind multiplication, it’ll reach a point the multiplications start failing³.

Now, check the article. It’s pretty much a generalisation of my example above; instead of talking about multiplications, it’s talking about reasoning in general, as applied to tasks such as graph connectivity, counting asteroids, and others.

A few other of those misleading analogies: “learning”, “hallucination”, “attention”, “semantic supplement”. I’d argue even calling them “large language models” is an example of that. Or “large reasoning models”; frankly both are large token models.
Before someone vomits an “ackshyually”: yes I know this is an oversimplification, but it’s accurate enough for the point I’m delivering.
Yes, this means the “hallucination” problem is unsolvable.[/captain obvious] And, more on-topic: that while large token models might handle increasingly complex tasks, the limit will be always there, and it’ll demand more and more data to push that limit away.
That’s exactly what you see ChatGPT and the likes doing; IIRC they start failing multiplications once the factors are 6+ digits large. Note Gemini “cheats” on that - it invokes Python to do this job, because the LLM itself cannot.