Journey to the center of simulated reasoning

28/01/2026

Viaje al centro del razonamiento simulado

In the first post in this series we argued for a simple idea: large language models (LLMs) shine when they can recognize and recombine patterns they have already seen, and they falter when we ask them to find a new rule in unfamiliar terrain. This second post calmly goes into greater depth in these ideas to understand the limits and challenges of a future Artificial General Intelligence (AGI), which involves understanding what exactly "thinking" means for an AI and why it sometimes appears to be thinking today.

To give a simple answer, let’s consider two ways of solving problems: One consists of recognizing a known drawing after having studied it by heart, as when we read a recipe and follow it step by step without making up anything. The other means exploring, trial and error and, with any luck, finding a new rule that will save us some time. Thus, we will call reuse patterns to recognize the shape of the problem, fill in the gaps, and provide a coherent answer; and discover to explore alternatives, discard judiciously, verify intermediate steps, and hopefully find a new rule that makes the problem manageable. The former explains many day-to-day successes of AI; the latter is what we associate with "thinking something through."

It is not always easy to distinguish between the two when we focus only on the result. An LLM model may get it right out of sheer familiarity with similar exercises, by fitting pieces together efficiently. From the outside it may appear to have "reasoned" in some deep way. So how do we know if a model is "discovering" when it responds and not just "remembering"? The temptation is to use lists of difficult questions and measure successes, but that way of evaluating it can be misleading. If the examination looks too similar to what is already circulating online (which is how many LLMs are trained), the model may be noteworthy because of what it has memorized, not because of a "mental" exploration.

Therefore, it makes sense to assess not only whether he gets it right, but how it arrives at the answer. The idea is simple and very human: If we want to know if someone really knows, we don't give him the problem he has already worked on, but a similar one, with a twist, and watch how he manages. It is the same with AI; what we are looking for is not just a right answer, but signs of a good process: that the system discards judiciously, that it checks what it affirms, and that it rectifies when it stumbles. For this reason, in recent months, assessments with interactive and changing environments, which reduce the advantages of memorizing and force you to adapt on the fly, have gained importance. One example is ARC-AGI-3, which builds completely new small worlds on the fly and measures the efficiency with which an agent acquires skills in scenarios it has not seen before[1]. GAMEBoT also evaluates decisions made during games and makes the intermediate process public, which helps to separate real strategy from one-off successes[2].

At the same time, as we noted in the previous post, simple ways to "think" better during the generation of answers have been proposed. One of the most accessible is to ask the model for several ideas at the same time and keep the most consistent one. This technique, known as self-consistency, is based on a reasonable principle: the same correct problem can be reached via different routes; comparing several routes reduces chance errors and improves the reliability of the result. [3]. Another approach, the Tree of Thoughts, suggests not following the same thread, but rather exploring ramifications, and backtracking if something doesn't add up and consolidate what works, as any of us would do with a pencil-and-paper problem[4]. In tasks where the option space is growing rapidly, some proposals integrate a MonteCarlo tree search [5] to guide this exploration in a more orderly and verifiable way, rather than blindly multiplying trains of thought. [6,7].

It is precisely in fast-growing problems that the difference between repetition and discovery is quickly seen. Imagine a maze that forks at every corner into three or four new corridors; with two turns it’s still manageable, but with ten it starts to become overwhelming, but twenty makes it a jungle where the trees do not let us see beyond. People survive thanks to "shortcuts": we discard corridors by intuition, we look for signs to guide us, we remember patterns from other mazes. An automatic system also needs these "shortcuts" to keep from getting lost. When the model gets it right, sometimes it’s not because it has explored everything, but rather because someone else already discovered the way to narrow down the possibilities and select the routes, and the trick was "saved" in its parameters. Therein lies much of the practical magic of LLMs: leveraging good ideas from the past at breakneck speed.

In a more advanced form, proposals are emerging to adapt the model in use time when the context demands it. Under the name of test-time learning, several research initiatives have shown that it is possible to slightly adjust the behavior of these systems using the data they analyze during the instants in which they are run, with no human labels and with localized changes. This brings significant improvements when there are mismatches between what the model knows and what the case requires. The idea would be to "learn a little while you respond", with mechanisms designed not to forget what was already working. [8,9]. Another line combines generation and verification in the inference itself: instead of voting among many strings, a verifier is trained or guided to help decide better with the same computing budget. [10,11].

There are also practical limits that should not be overlooked. Exploring in depth takes time and energy, in the literal sense. If we ask a system to consider many hypotheses, verify sub-steps and retain memory of what it has already tested, we are consuming more computing resources. It's nothing dramatic, it's a design decision: if the problem is worth it (such as a medical diagnosis, a complex mission path, a security audit, etc.), you may want to "pay" for that discoverability and to achieve a more careful process. If the problem is an everyday, frequent one (such as a mail, a summary, or a translation), it may be wise to stick with the swiftness of pattern reuse. This is not about expecting miracles, but rather understanding what is really going on below the surface in order to choose the right tool for each task: knowing when a fast tool suffices and when we need one that truly explores.

In the spirit of the first post, we would say that it makes no sense to ask a hammer to do the work of a screwdriver; nor does it make sense to sell a hammer as if it were an infinite tool kit. With language models, the responsible thing to do is to recognize their value as accelerators of tasks with known structure and, at the same time, to recognize that in-depth thinking requires exploring, verifying and sometimes learning something new during the process itself. To call this "reasoning" without nuance is confusing; to deny it completely would also be wrong. I think the virtue lies in being clear about our expectations.

In my work in GMV I seek to apply this criterion in real projects. Part of my job is to make AI evolve, be useful and responsible in sectors where the bar for quality is very high. In this article I’m sharing what I have learned so that anyone –from beginners to those who have been at it for years – can distinguish between apparent brilliance and real progress. If this series of posts has helped you look at AI a little more calmly in a world overloaded with hype, it's done its job; in the third installment I'll talk about us, how our own behavior changes when we work with these tools, and why that change deserves as much attention as the algorithms.

Author: David Miraut

REFERENCES:

[1] «ARC-AGI-3», ARC Prize. Accedido: 25 de septiembre de 2025. [En línea]. Disponible en: https://arcprize.org/arc-agi/3/

[2] W. Lin, J. Roberts, Y. Yang, S. Albanie, Z. Lu, y K. Han, «GAMEBoT: Transparent Assessment of LLM Reasoning in Games», en Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, y M. T. Pilehvar, Eds., Vienna, Austria: Association for Computational Linguistics, jul. 2025, pp. 7656-7682. doi: 10.18653/v1/2025.acl-long.378.

[3] X. Wang et al., «Self-Consistency Improves Chain of Thought Reasoning in Language Models», 7 de marzo de 2023, arXiv: arXiv:2203.11171. doi: 10.48550/arXiv.2203.11171.

[4] S. Yao et al., «Tree of Thoughts: Deliberate Problem Solving with Large Language Models», 3 de diciembre de 2023, arXiv: arXiv:2305.10601. doi: 10.48550/arXiv.2305.10601.

[5] «Árbol de búsqueda Monte Carlo», Wikipedia, la enciclopedia libre. 21 de enero de 2021. Accedido: 25 de septiembre de 2025. [En línea]. Disponible en: https://es.wikipedia.org/w/index.php?title=%C3%81rbol_de_b%C3%BAsqueda_Monte_Carlo&oldid=132591437

[6] Y. Xie et al., «Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning», 17 de junio de 2024, arXiv: arXiv:2405.00451. doi: 10.48550/arXiv.2405.00451.

[7] Z. Gao et al., «Interpretable Contrastive Monte Carlo Tree Search Reasoning», 25 de diciembre de 2024, arXiv: arXiv:2410.01707. doi: 10.48550/arXiv.2410.01707.

[8] J. Hu et al., «Test-Time Learning for Large Language Models», 27 de mayo de 2025, arXiv: arXiv:2505.20633. doi: 10.48550/arXiv.2505.20633.

[9] Y. Sun, X. Wang, Z. Liu, J. Miller, A. Efros, y M. Hardt, «Test-Time Training with Self-Supervision for Generalization under Distribution Shifts», en Proceedings of the 37th International Conference on Machine Learning, PMLR, nov. 2020, pp. 9229-9248. Accedido: 25 de septiembre de 2025. [En línea]. Disponible en: https://proceedings.mlr.press/v119/sun20b.html

[10] Z. Liang, Y. Liu, T. Niu, X. Zhang, Y. Zhou, y S. Yavuz, «Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification», 5 de octubre de 2024, arXiv: arXiv:2410.05318. doi: 10.48550/arXiv.2410.05318.

[11] J. Qi, H. Tang, y Z. Zhu, «VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers», 10 de octubre de 2024, arXiv: arXiv:2410.08048. doi: 10.48550/arXiv.2410.08048.

Journey to the center of simulated reasoning

Comments

Plain text

Related