Does Skynet dream of electric sheep?

14/10/2025

Over the past four years, we’ve contributed to dizzying progress in large language models (LLMs). For many, including myself, the most striking development has been the rise of models that “think before they answer,” capable of creating a kind of intermediate path where they break down the steps to solve a task. The idea is intriguing: if the model stops to “think,” perhaps it will come closer to how we humans reason (and help us better understand the workings of our own minds). But let’s not get ahead of ourselves. What these LLMs do well, they do very well. However, it’s just as important to understand what they can’t yet do so we’re not fooled.

This summer, an Apple research paper [1] sparked a fascinating discussion. The authors argued that once a certain threshold of difficulty was reached, LLM models not only started to fail more, but also reduced their “reasoning effort” even when they still had computing power to spare. In their experiments, when confronted with the Towers of Hanoi puzzle with multiple disks, the models seemed to “throw in the towel” too early. The researchers called this behavior “anticipatory fatigue.” A detailed analysis by independent reviewers revealed that part of this effect could be explained by the way the tests were designed: tasks with a large number of steps, where a single mistake creates a domino effect leading to further errors, and tasks with memory demands (context window) exceeded what the tested LLMs were able to handle at once. In that context, giving up early wasn’t really an “emotion,” but rather a practical consequence: it’s hard to keep trying to get to the finish line if you know you’re going to get lost along the way.

Methodological details aside, Apple’s paper had a tremendously positive impact: it pushed the community to measure these “think aloud” strategies more carefully and explore how to make them more reliable. Since then, a wide range of approaches have emerged. Some are inspired by human problem-solving habits: breaking down a task into smaller sub-steps, comparing one idea with another, or seeking a second opinion before making a decision. Others explore less intuitive routes, such as reinforcement training, which rewards useful behaviors even if they can’t easily be explained in words. We have OpenAI [2] and Anthropic [3] working on more “human-like” approaches, and DeepSeek trying different shortcuts. [4]

One useful takeaway from the research published over the summer is that, when a model tries to reason through very long chains, the chance of small mistakes snowballing into larger ones increases. Sometimes it happens because the system latches onto an explanation that sounds good and stops considering alternatives, thus developing a kind of “tunnel vision.”[5] Other times, it loses confidence in its own train of “thought,”[6] contradicts itself, and ultimately chooses an inferior course of action. To counter this phenomenon, researchers have tried having the model generate several “lines of thought” at the same time and then stick with the most promising one, [7] or use a weighted combination. These mechanisms improve results in many cases, though at a higher computational cost. This scientific exploration has also helped us better understand another well-known problem: how easily models “make things up.” They don’t do this maliciously. They’ve simply learned that, in the absence of data, completing a sentence with something plausible is often better than remaining silent [8] (the technical term for the phenomenon is “hallucination,” and it poses a serious computer security risk [9]). That’s precisely why researchers are now working on how to teach models to question themselves, recognize errors, and verify before making claims.

But the metaphor in this post’s title points to something deeper. Today, what we call "reasoning" in these systems is still a very limited and superficial simulation, as we’ll see in the next post in this series. Models excel when they can reuse patterns they’ve seen thousands of times during their training. If we change the names of the pieces of a problem, if we introduce irrelevant and distracting data, or if we ask them to extrapolate to new structures, their performance suffers significantly.

In my view, the label “reasoning models” is a potentially misleading marketing term, since current AIs do not reason like humans. The evidence shows that current models struggle not only with symbolic reasoning but also with structural generalization. They do not construct new meanings or verify information intentionally. They operate on correlations, not understanding. They simulate reasoning; they don’t experience it. Other researchers, including recent Nobel laureate Yann LeCun, have discussed these aspects at length. LLMs do something different, valuable in many contexts, but different.

This doesn’t take away from the enormous value of the progress made, which is real and highly useful. But it’s precisely why we must maintain a critical, honest perspective. If we get caught up in the hype of grandiose headlines, we may mistake effective pattern reuse for deep thought. And they’re not the same. Pattern reuse is incredibly useful for working faster and better on repetitive tasks or tasks with familiar structures. Deep thought, on the other hand, arises when new paths have to be explored, options have to be discarded, results have to be checked and, sometimes, the approach has to be changed on the fly. For an automated system to reliably approach that level of intelligence, it needs to be able to do more than string sentences together: it needs to explore, verify, and learn in the process itself.

Before we delegate critical decisions to these models (in autonomous driving, cybersecurity, systems engineering, or even policymaking [10]), it’s worth asking two simple questions. First: Is this problem sufficiently similar to others that the LLM model has already seen, or are we asking it to navigate uncharted territory? Second: Has there been any independent verification of what it’s proposing, or is this simply an answer that sounds convincing? Answering these questions honestly will avoid misunderstandings and lead to better decision-making. And that’s the idea I’d like to explore in this series of posts. In the next installment, we’ll see why many LLM success stories are, above all, brilliant cases of pattern reuse. In the third, we’ll discuss something that concerns me greatly: how our own behavior changes when we work with these tools and how much they influence the way we think and solve problems.

In this article, I want to make AI more accessible to people, taking a useful, responsible, and transparent approach. My goal with this outreach is to share what I learn from the projects I’m working on at GMV in order to help people distinguish between expectations and reality, so that anyone (from students to professionals) can make better decisions. I’d love for you to follow this series, share your questions, and join the conversation.

Author: David Miruat

REFERENCES:

[1] P. Shojaee, I. Mirzadeh, K. Alizadeh, M. Horton, S. Bengio, and M. Farajtabar, “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” July 18, 2025, arXiv: arXiv:2506.06941. doi: 10.48550/arXiv.2506.06941.

[2] A. El-Kishky et al., “Learning to Reason with LLMs.” OpenAI, September 2024. Accessed: September 25, 2025. [Online]. Available at: https://openai.com/index/learning-to-reason-with-llms/

[3] Y. Chen et al., “Reasoning Models Don’t Always Say What They Think,” May 8, 2025, arXiv: arXiv:2505.05410. doi: 10.48550/arXiv.2505.05410.

[4] D. Guo et al., “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,” Nature, vol. 645, no^. 8081, pp. 633-638, Sept. 2025, doi: 10.1038/s41586-025-09422-z.

[5] H. Wen et al., “ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute,” August 30, 2025, arXiv: arXiv:2509.04475. doi: 10.48550/arXiv.2509.04475.

[6] A. Sinha, A. Arun, S. Goel, S. Staab, and J. Geiping, “The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs,” September 11, 2025, arXiv: arXiv:2509.09677. doi: 10.48550/arXiv.2509.09677.

[7] W. Zhao, P. Aggarwal, S. Saha, A. Celikyilmaz, J. Weston, and I. Kulikov, “The Majority is not always right: RL training for solution aggregation,” September 8, 2025, arXiv: arXiv:2509.06870. doi: 10.48550/arXiv.2509.06870.

[8] A. T. Kalai, O. Nachum, S. S. Vempala, and E. Zhang, “Why Language Models Hallucinate,” September 4, 2025, arXiv: arXiv:2509.04664. doi: 10.48550/arXiv.2509.04664.

[9] D. Miraut, “Slopsquatting: A silent threat born from the hallucinations of LLMs,” GMV Blog. Accessed: September 25, 2025. [Online]. Available at: https://www.gmv.com/en-es/media/blog/cybersecurity/slopsquatting-silent-threat-born-hallucinations-llms

[10] RTVE.es “,El Gobierno de Albania nombra a una “ministra” creada con Inteligencia Artificial para acabar con la corrupción” [“The Government of Albania appoints an Ai-created ‘minister’ to crack down on corruption”] RTVE.es. Accessed: September 25, 2025. [Online]. Available at: https://www.rtve.es/noticias/20250912/gobierno-albania-nombra-ministra-creada-inteligencia-artificial-acabar-corrupcion/16726028.shtml

Does Skynet dream of electric sheep?

Comments

Plain text

Related