GPT-4 has a 128K-token context window. Claude has 200K. These are often described as "thinking space" — the larger the window, the more the model can reason about. A popular line of recent research goes further: chain-of-thought prompting, where the model generates intermediate reasoning tokens, has been shown to make transformers Turing complete. The implication is that today's language models, with enough chain of thought, can compute anything computable.

This is wrong. Here's why, and what actually makes the difference.

The finite automaton argument

A transformer with a fixed context window has a fixed number of possible states. For a window of C tokens over a vocabulary of size V, there are at most VC distinguishable configurations. That's a large number — but it's a number. It's finite.

A system with a finite number of states that transitions deterministically from one state to the next is, by definition, a finite automaton. It doesn't matter how clever the transition function is. It doesn't matter that a transformer is a universal function approximator within its context. After at most VC steps, it must revisit a state it's seen before. Once it cycles, it cycles forever. No new computation happens.

Chain of thought doesn't change this. CoT tokens are written into the context window, consuming space within the same finite state. The computation is richer — the model explores more of its state space before cycling — but the state space itself doesn't grow. A fixed window with chain of thought is still a finite automaton. Just a busier one.

This contradicts the "CoT is Turing complete" narrative. What went wrong?

What the papers actually prove

Feng et al. (2023) proved that a constant-depth transformer generating intermediate tokens can simulate any Turing machine. Merrill & Sabharwal (2024) showed that transformers with polynomial-length CoT can solve all problems in P. These are genuine results. But they share an assumption: the chain of thought is allowed to grow without bound as the input grows.

This means the context window must scale with the problem. A problem of size n gets a window of size poly(n). That's not how any deployed system works. GPT-4's window is 128K whether you give it a 10-word prompt or a 100K-word document. The window is a constant, not a function of the input.

Li & Wang (2025) got the precise result: the computational class of a transformer with window size W is exactly SPACE(W). For fixed W, that's a finite complexity class. Powerful — but bounded.

What actually makes it Turing complete

The answer is almost embarrassingly simple: an external read/write store.

Give a transformer access to a file system, a database, a persistent key-value store — anything it can read from, write to, and that persists between steps — and the composition is a Turing machine. The context window serves as the finite state control. The external store serves as the tape. The model reads from the store into its window, computes, writes back to the store, and moves on. This is literally the definition of a Turing machine: finite control plus infinite tape.

We proved this formally in Lean 4 — a step-for-step simulation showing that any Turing machine with S states is perfectly replicated by a controller of size S operating on an external store. The proof has zero unresolved axioms.

The key requirements for the store:

  1. Random-access read: the model can request data from any address
  2. Random-access write: the model can write data to any address
  3. Persistence: writes survive between steps

A file system qualifies. A database qualifies. A code interpreter with persistent variables qualifies. Chain of thought within a fixed context window does not qualify — it lacks persistence beyond the window.

Why this matters

For architecture. External tools aren't a convenience — they're a computational necessity. The difference between an LLM with and without an external store isn't "faster" versus "slower." It's "finite automaton" versus "Turing machine." A qualitative jump in what's computable, not a quantitative improvement in speed.

For scaling. Expanding the context window from 128K to 1M tokens makes the finite automaton larger, but it's still a finite automaton. It can handle bigger problems within its bound, but it cannot handle problems that exceed that bound no matter how long it runs. An 8K-token model with a well-organized external store is strictly more powerful than a 1M-token model without one.

For organization. Once you have the store, a second question arises: does it matter how the store is organized? The answer is yes, dramatically. We prove (in a companion result, the Library Theorem) that an indexed store yields O(log N) retrieval versus O(N) for an unstructured store — an exponential gap. Over T reasoning steps, this compounds: O(T log N) versus Θ(T²). The more complex the problem, the more organization matters.

Externalization is what makes you capable. Organization is what makes you efficient. And the advantage of both grows without bound.

What this means in practice

When you give Claude access to a computer tool, or let GPT-4 write and execute code, or connect an LLM to a database — you are not adding a convenience feature. You are upgrading a finite automaton to a Turing machine. The model goes from "can solve any problem that fits in its window" to "can solve any computable problem, period."

And when that external store is organized — indexed, structured, navigable — the model doesn't just gain capability. It gains efficiency that scales with problem complexity. A flat text file of notes helps. A well-indexed database helps quadratically more.

This is, incidentally, why libraries were invented. Not as a convenience for forgetful scholars, but as a computational necessity for thinking beyond what fits in one head. The math is the same. The medium is different.


All results are mechanically verified in Lean 4 with zero unresolved axioms. Technical details in "External Memory, Not Chain of Thought: What Makes Bounded Transformers Turing Complete" (Mainen 2026).