Mind-Blowing! MIT Researcher Builds a Computer Inside Transformer: Do LLMs Still Need External Tools?

Illustration of AI and computing concept

LLMs can win gold medals at the International Mathematical Olympiad (IMO), yet they stumble over elementary school arithmetic.

This paradox has long plagued the entire AI community.

Now, someone has proposed a radically new solution—not by attaching another external tool, but by building a computer directly inside the Transformer.

Even the legendary "AK" (Andrej Karpathy) has exclaimed: "This is awesome!"

The Fatal Flaw of Large Models

Current state-of-the-art language models deliver impressive performance in mathematical reasoning—GPT-class systems can reach IMO gold medal standards and tackle open-ended scientific challenges.

However, one stubborn weakness persists: pure computational tasks.

They make mistakes on basic addition. They cannot solve simple Sudoku puzzles without external aid. Benchmark results like Sudoku-Bench show that large models have extremely low success rates when solving problems without assistance.

The two prevailing workarounds are:

Tool Use: The model writes code, an external interpreter executes it, and the result is fed back. This works, but the execution itself happens outside the model.

Agent Orchestration: Using external loops to save intermediate states, decompose tasks, and repeatedly call the model. Essentially, this wraps a state machine around the outside of the model.

An analogy clarifies the essence of the problem: Humans cannot fly, and building airplanes didn't change that fact; it merely created a machine that flies for us.

Today's large models face the exact same dilemma with computational tasks—they can describe algorithms and coordinate tools to run them, but they cannot execute the algorithms themselves. A system that cannot compute cannot truly understand what computation is.

Building a Computer Inside the Transformer

Christos Tzamos, a PhD researcher from MIT, and his team chose to tackle this head-on.

Diagram of internal computer architecture

Their core solution: Implement a modern RAM computer inside the Transformer and compile arbitrary C code into a sequence of tokens that the model can execute directly.

Specifically, they implemented a WebAssembly interpreter within the Transformer's weights. WebAssembly is a low-level instruction set that languages like C/C++ can compile directly into. Each instruction maps to at most 5 tokens.

The process of executing 3+5 works like this: The model generates a sequence of WebAssembly instructions, then switches to a fast decoding mode, executing the program token-by-token within the same Transformer, outputting the complete execution trace:

03 00 00 00  commit(+1,sts=1,bt=0)
05 00 00 00  commit(+1,sts=1,bt=0)
08 00 00 00  commit(-1,sts=1,bt=0)
out(08)
halt

Stack growth, addition triggering, result output, and machine halting—all completed within the model's own output stream, with absolutely no external calls.

Tool use is opaque: The model hands over control and receives a black-box answer. Internal execution is transparent: Every intermediate step appears in the trace, and the model never leaves its own decoding loop.

Sudoku: Even the Hardest Puzzles Are Solved

Sudoku serves as another stress test for long-chain precise computation.

Neural network methods perform well on simple or random Sudoku puzzles but collapse when facing difficult ones. The usual explanation is that autoregressive models submit answers token-by-token and cannot correct early errors, making them naturally unsuitable for constraint satisfaction problems.

This work offers a different answer: The problem isn't the autoregressive paradigm itself, but that solving hard puzzles requires extremely long execution traces, and the standard attention mechanism makes generating long contexts prohibitively expensive.

Their system executed a compiled Sudoku solver inside the Transformer, achieving 100% accuracy, including on the world-renowned hardest Arto Inkala Sudoku—providing the correct answer within 3 minutes.

The correctness is universal: As long as the compiled solver itself is correct, the Transformer's execution result is correct. There are no learned heuristic guesses, and no gap between a "model-suggested answer" and an "external system-verified answer."

Core Technical Breakthrough: Exponentially Faster Attention Mechanism

For this scheme to truly hold water, a deeper engineering obstacle had to be overcome.

Transformers acting as executors have a structural flaw: Standard autoregressive decoding requires interacting with an ever-growing history sequence at every step. Real computers update compact states, with roughly constant computation per instruction. However, when a Transformer generates the t-th token, it must still interact with a prefix of length t. While KV caching saves recomputation costs, the cost of scanning the cache still grows linearly with sequence length.

The result: The computation per step grows linearly with trace length, making the total cost of generating t tokens quadratic. This is the classic Transformer bottleneck.

The research team's breakthrough lies in discovering that, in structured scenarios like execution traces, the Transformer's attention mechanism can take a completely different decoding path.

The key constraint: Limit the dimension of attention heads to 2.

This limitation causes a qualitative shift.

In 2D, attention queries can be reformulated in geometric terms: The key vectors of all historical tokens form a set of points on a plane. Each query is equivalent to performing a maximum inner product search on this set—i.e., finding the farthest point on the convex hull in a given direction. This is a classic problem in computational geometry, solvable with data structures having logarithmic time complexity.

Thus, the linear scan in standard decoding (scoring each key one by one) is replaced by a convex hull query (maintaining a geometric data structure where each retrieval only accesses a tiny fraction of points).

The effect: Decoding per step drops from Θ(t) to O(log t).

In empirical results, the growth curve of per-step time consumption versus sequence length for HullKVCache compared to standard KVCache shows a dramatic difference.

The entire system achieves a throughput of over 30,000 tokens per second on CPU, sufficient to sustain program execution for millions of steps.

Is 2D Enough?

Isn't this constraint too strong?

The research team answers: For Turing completeness, 2D attention is sufficient, and they proved this in their paper.

The model itself is a completely standard PyTorch Transformer, with no customized attention kernels and no sparse masks. It uses d_model=36, n_heads=18, exactly 2 dimensions per head, and 7 layers. The only uniqueness lies in the weights themselves.

The entire model can still have any number of layers, any number of heads, and any embedding dimension size; the 2D constraint only applies to the key-value pairs within each head, trading off for the ability to have more heads.

For softmax attention, approximation schemes are also feasible: By retrieving the top-k keys and performing softmax only on these keys, a decoding cost of O(k + log n) can be achieved. The same logic can be extended to 3D heads (based on 3D convex hulls), although efficiency drops rapidly in higher dimensions.

What's Next?

This work opens up not just a direction for model optimization, but a new interface between software and neural networks.

Hybrid Systems: Let the language model handle planning and reasoning, while the internal execution engine runs algorithms. The boundary between the two is not an external API call, but different paths within the same forward propagation process. Since the execution trace is part of the forward pass, the entire process is differentiable—gradients can propagate through the computation itself, which is fundamentally different from external tools.

Programs Compiled into Weights: The current prototype learns an interpreter within the weights. However, the compilation mechanism built by the team can go further—arbitrary programs can be compiled directly into Transformer weights without needing to be represented as token sequences. This means the weights themselves can become the deployment target for software.

Training Beyond Gradient Descent: If logic can be compiled into weights, gradient descent is no longer the only way to modify models. Weight compilation offers another path to directly inject structure, algorithms, and reliability guarantees into networks.

AI Systems Growing Like Software Libraries: The modern software ecosystem evolves by accumulating modules, abstractions, and reusable components. A similar process could occur inside AI systems—new computational capabilities added incrementally to the model's internal execution engine.

The research team's ultimate vision is this: Future AI systems will not just use software, but contain it—integrating learned representations and compiled algorithms into the same computational substrate. In that world, software itself becomes part of the model.

For detailed information, please visit:

https://www.percepta.ai/blog/can-llms-be-computers

--end--