The Quiet Revolution in Robot Brains: How a New 'Thinking' Architecture Could Make Machines Genuinely Smarter

A team of researchers from Tsinghua University, Shanghai Qi Zhi Institute, and Shanghai AI Laboratory has introduced something that sounds almost paradoxical: a robot that reasons before it acts. Not in the way current systems process sensor data through neural networks. Something different. Something that borrows the chain-of-thought reasoning now common in large language models and applies it, for the first time in a structured way, to the physical control of robots navigating the real world.

The paper, titled “Thinking Before Acting: Exploring Reasoning-Action Connection in Vision-Language-Action Models,” was posted to arXiv in late April 2025 and lays out an architecture the authors call ThinkAct. Its central claim is bold but carefully supported: by forcing a vision-language-action model to generate explicit reasoning traces before producing motor commands, you get a robot that generalizes better, follows instructions more accurately, and recovers from novel situations that would stump conventional systems.

This matters enormously for the robotics industry. And it matters right now.

Why Reasoning Has Been the Missing Piece in Robotic Manipulation

For years, the dominant approach to training robots for manipulation tasks — picking up objects, opening drawers, stacking blocks — has relied on end-to-end learning. Feed the model camera images and a language instruction like “put the red block on the blue block,” and train it to directly output joint torques or end-effector positions. The approach works. It works well enough that companies like Google DeepMind, with their RT-2 model, and startups across the Bay Area have built impressive demonstrations around it.

But there’s a persistent problem. These models are brittle. Move the red block to an unfamiliar position on the table. Change the lighting. Swap in a slightly different gripper. Performance degrades, sometimes catastrophically. The robot doesn’t understand what it’s doing. It has learned a statistical mapping from pixels to actions, and when the statistics shift, the mapping breaks.

The ThinkAct team argues this brittleness stems from a fundamental architectural deficit: the absence of intermediate reasoning. In large language models, chain-of-thought prompting — where the model is encouraged to “show its work” before answering — has proven remarkably effective at improving accuracy on math problems, logic puzzles, and complex question-answering tasks. OpenAI’s o1 model and subsequent reasoning-focused systems have demonstrated that thinking tokens aren’t just explanatory fluff. They change the quality of the output.

The researchers asked a simple question: what if robots did the same thing?

Their answer is ThinkAct, a framework built on top of a vision-language-action (VLA) model that introduces what they call a “reasoning-action connection.” Before the model generates any physical action, it first produces a natural-language reasoning trace. This trace describes the current state of the scene, identifies the relevant objects, articulates the goal, and plans the next step. Only then does the model output an action.

This isn’t just appending a text generation module to an existing robot controller. The architecture is trained end-to-end so that the reasoning and action generation are tightly coupled. The quality of the reasoning directly influences the quality of the action, and vice versa — the requirement to produce good actions pressures the model to reason well.

The results, evaluated on the CALVIN benchmark and in real-world experiments, are striking. ThinkAct achieves state-of-the-art performance on long-horizon manipulation tasks, outperforming prior VLA models on sequences of five chained instructions. On the hardest evaluation setting — where the robot must complete all five tasks in succession without failure — ThinkAct improved over the previous best by a significant margin. More importantly, the model showed strong generalization to unseen object configurations and instructions that weren’t present in the training data.

A few numbers stand out. On the CALVIN ABC→D benchmark, which tests generalization to a new environment, ThinkAct achieved an average task completion length of 4.01 out of 5, compared to 3.48 for the next-best model. That gap might sound small. It isn’t. In chained task execution, errors compound — a small improvement in per-step reliability translates to a large improvement in overall success rate.

The ablation studies in the paper are particularly revealing. When the researchers removed the reasoning traces and trained the same model architecture to predict actions directly from visual input and language instructions — essentially reverting to the standard VLA approach — performance dropped substantially. When they provided reasoning traces at training time but not at inference time, performance also dropped, though less severely. The best results came when reasoning was present during both training and inference, confirming that the model isn’t just using reasoning as a training regularizer. It’s actively using the generated text to inform its motor outputs at test time.

The Broader Race to Make Robots Think

ThinkAct doesn’t exist in a vacuum. The paper arrives amid a surge of interest in combining large language models with robotic control. Google DeepMind’s RT-2, published in 2023, demonstrated that a vision-language model could be fine-tuned to output robot actions, essentially treating motor commands as another “language” the model could speak. That work showed impressive zero-shot generalization — a robot that could follow instructions involving objects and concepts it had never been explicitly trained on.

But RT-2 and its successors still operate in the direct-prediction paradigm. Image in, action out. No intermediate reasoning step. The ThinkAct team positions their work as the next logical evolution: if language models benefit from thinking before answering, robot models should benefit from thinking before acting.

They’re not the only ones pursuing this idea. Recent work from MIT, Stanford, and several Chinese AI labs has explored various forms of “inner monologue” for robots, where a language model generates plans or subgoals that a lower-level controller then executes. The SayCan framework from Google, for instance, uses an LLM to propose actions and a learned affordance model to filter them based on physical feasibility. But these approaches typically treat reasoning and action as separate modules, connected by an interface. ThinkAct’s contribution is integrating them into a single model trained with a unified objective.

This integration matters for a practical reason: latency. In a modular system, the language model generates a plan, the plan is parsed, the parsed plan is sent to a controller, and the controller executes. Each handoff introduces delay and potential misinterpretation. In ThinkAct, the reasoning and action come from the same forward pass through the model. The thinking is fast — a few dozen tokens — and directly produces motor commands without any intermediate parsing or translation.

The industry implications are significant. Companies building general-purpose robots — Figure AI, 1X Technologies, Agility Robotics, and others — are all grappling with the generalization problem. A robot that works perfectly in a controlled lab setting but fails in a customer’s warehouse isn’t a product. It’s a demo. If reasoning-augmented architectures like ThinkAct can deliver meaningful improvements in real-world robustness, they could accelerate the timeline for commercially viable general-purpose manipulation.

There are caveats. The CALVIN benchmark, while useful, is a simulated environment with relatively simple objects and tasks. The real-world experiments in the paper are limited in scope — a single robot arm performing tabletop manipulation tasks. Scaling this approach to more complex environments, longer time horizons, and different robot morphologies remains an open challenge. The computational cost of generating reasoning tokens at inference time, while modest for current tasks, could become a bottleneck in time-critical applications.

And there’s a deeper question the paper raises but doesn’t fully answer: are the reasoning traces actually “reasoning” in any meaningful sense, or are they functioning as a structured intermediate representation that happens to be expressed in natural language? The authors show examples of generated traces that are coherent and task-relevant — “The gripper is open and positioned above the red block. I need to lower the gripper to grasp it.” But whether the model is genuinely reasoning about physics and spatial relationships, or simply generating plausible-sounding text that correlates with good actions, is an open philosophical and empirical question.

For practitioners, though, the philosophical question may be beside the point. What matters is that it works. And the evidence presented in the paper suggests that it does — better than alternatives that skip the thinking step.

So where does this leave the field? The trajectory seems clear. Vision-language-action models are becoming the dominant paradigm for robot learning. Adding structured reasoning to these models improves their performance and generalization. The next steps will likely involve scaling up the reasoning — longer chains of thought for more complex tasks, integration with memory systems for multi-step planning over minutes or hours rather than seconds, and training on vastly larger datasets of robot experience paired with reasoning annotations.

The challenge of generating those annotations at scale is nontrivial. The ThinkAct team used a combination of human-written reasoning traces and LLM-generated ones, then filtered for quality. That approach works for research. For industry-scale deployment, automated reasoning annotation — perhaps using a powerful LLM to watch robot demonstrations and generate post-hoc explanations of what the robot was doing and why — will be essential.

One more thing worth watching: the relationship between reasoning quality and action quality. The paper shows that better reasoning leads to better actions. But it also hints at a feedback loop — training the model to produce good actions can improve the quality of its reasoning, even without explicit supervision on the reasoning traces. If this finding holds up at scale, it suggests that robot reasoning could improve as a byproduct of action-focused training, reducing the need for expensive reasoning annotations.

The ThinkAct paper, authored by Liyuan Wang, Aiguo Chen, Jiwen Lu, and colleagues, is available as a preprint on arXiv. The code and model weights have not yet been released, though the authors indicate they plan to do so. For robotics engineers and researchers working on manipulation, the paper is essential reading — not because it solves the generalization problem, but because it offers a compelling and well-validated approach to a piece of it that has been conspicuously missing.

Robots that think before they act. It sounds obvious in retrospect. The hard part was making it work.

The Quiet Revolution in Robot Brains: How a New ‘Thinking’ Architecture Could Make Machines Genuinely Smarter

Notice an error?

Ready to get started?