6
Large language models (LLMs) have shown significant progress in reasoning tasks. However, recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs) -- LLMs fine-tuned with incentives for step-by-step argumentation and self-verification. LRM performance on graph and reasoning benchmarks such as NLGraph seem extraordinary, with some even claiming they are capable of generalized reasoning and innovation in reasoning-intensive fields such as mathematics, physics, medicine, and law. However, by more carefully scaling the complexity of reasoning problems, we show existing benchmarks actually have limited complexity. We develop a new dataset, the Deep Reasoning Dataset (DeepRD), along with a generative process for producing unlimited examples of scalable complexity. We use this dataset to evaluate model performance on graph connectivity and natural language proof planning. We find that the performance of LRMs drop abruptly at sufficient complexity and do not generalize. We also relate our LRM results to the distributions of the complexities of large, real-world knowledge graphs, interaction graphs, and proof datasets. We find the majority of real-world examples fall inside the LRMs' success regime, yet the long tails expose substantial failure potential. Our analysis highlights the near-term utility of LRMs while underscoring the need for new methods that generalize beyond the complexity of examples in the training distribution.


I don’t want to be that guy, but… no, wait, I am that guy.
No current model reasons. Not even “reasoning” models - it’s just yet another misleading analogy¹, to make you believe it has better capabilities than it does.
At the end of the day, what they do is a more complex version of predicting what it should output for the next chunk of word, based on what is present in the data it processed (was “trained” with) plus some weighting². This is good enough to emulate reasoning in some cases, but it is still not reasoning.
Some muppet might say “ackshyually emulating it and having it is the same thing lol”, or “I don’t know if I’m just emulating reasoning lmao”, as if the issue was just mental masturbation. Not really, it’s a practical matter - reasoning is a requirement to reliably reach correct conclusions based on correct premises, and the emulation is not perfect, so where the emulation breaks the results become unreliable. In other words: the model will babble nonsense³ where the emulation fails.
For example, consider multiplications. If you correctly follow the reasoning behind multiplications, it doesn’t really matter if you’re multiplying numbers with two, 20 or even 2000 digits each - you’ll consistently reach the right result. However, if you’re simply emulating the reasoning behind multiplication, it’ll reach a point the multiplications start failing³.
Now, check the article. It’s pretty much a generalisation of my example above; instead of talking about multiplications, it’s talking about reasoning in general, as applied to tasks such as graph connectivity, counting asteroids, and others.