What If You Could Read a Model's Mind?

A language model knows things. It knows that Paris is the capital of France, that water boils at 100°C, that Shakespeare wrote Hamlet. This knowledge lives in the model's weights — billions of floating-point numbers, arranged in matrices, completely unreadable to any human.

What if you could take that knowledge out?

Not by asking the model questions one at a time. By actually decomposing the weight matrices into two parts: a reduced set of weights that handles computation, and an indexed external store that holds the factual knowledge in a form you can read, search, edit, and audit. The model would still work. But its knowledge would be outside, organized, and yours to inspect.

This is what redaction is.

The idea

Every weight in a neural network responds to some inputs and ignores others. A weight in an early attention layer might activate for tokens that look like dates. A weight deeper in the network might activate for sentences about geography. Most weights are surprisingly sparse: they care about a small fraction of possible inputs.

If a weight only matters for a narrow set of inputs, you don't need it in the weight matrix. You can replace it with a lookup: when the model encounters that input pattern, retrieve the weight's contribution from an external store. The remaining weight matrix is smaller, handles the dense computations, and the sparse knowledge sits in an indexed library.

This is the Library Theorem applied reflexively — to the model's own weights.

What the math says

We define a weight's dependency density: the fraction of inputs for which that weight materially affects the output. If a weight has low dependency density, it's sparse — it encodes specific knowledge, not general computation.

The theorem: under reasonable sparsity conditions, at least (1 − ε) of all weights can be externalized into an indexed store. The store size scales with the sparsity structure, not the number of weights. The reduced weight matrix retains the dense, general-purpose computation.

What the experiment showed

We ran this analysis on GPT-2. The results were sharper than expected.

About 65% of weights have dependency density below 0.01 at a tolerance of ε = 10⁻⁴. These are the specific-knowledge weights — the ones encoding facts, associations, and patterns that fire rarely but matter when they do. They're candidates for externalization.

The distribution is bimodal: weights are either very sparse or very dense. Almost nothing sits in the middle. This isn't an artifact of the analysis — it reflects the architecture. Layer norms are universally dense (dependency density ~0.77). Token embeddings are universally sparse (~97% qualify for externalization). Attention and MLP layers show the bimodal split.

There's a phase transition at ε = 10⁻⁴. Below that threshold, the decomposition is clean. Above it, the boundary between sparse and dense blurs. The model has a natural seam between "things it knows" and "things it does."

What this enables

Once knowledge is externalized, three things become possible that aren't possible with weights alone:

Auditability. You can inspect what the model knows. Not by asking it — models confabulate — but by reading the store directly. Every fact has a location, a provenance, and a context in which it was learned.

Selective intervention. You can edit specific knowledge without retraining. Remove a piece of misinformation. Update an outdated fact. Add private data that shouldn't be in the weights. The surgery is precise because the knowledge is indexed, not diffused across a billion parameters.

Provenance. When the model produces an output, you can trace which stored facts contributed. Not perfectly — the dense computation layer still mediates — but with far more transparency than black-box attribution.

The model's knowledge is not a black box. It is a library that hasn't been organized yet.

The connection

This paper is where the arc bends back on itself. Paper 7 showed that external memory makes a bounded transformer Turing complete. Paper 1 showed that organized external memory is exponentially more efficient. Paper R applies both results to the model's own internals: the weights already contain an implicit library. Redaction makes it explicit.

The forward pass of a neural network is a situation in the ontological sense (Paper 0): an entity (the input) belongs to a context (the computation) with qualities (the intermediate activations). The weights that fire are the participants. The outputs are the traces. Everything the ontology describes about external reasoning applies, reflexively, to the model's own reasoning.

Preliminary experiment on GPT-2 confirms bimodal sparsity prediction. Full externalization experiment in progress. Theory formalized in the Library Theorem framework. Details in "Redaction: Externalizing Knowledge from Neural Network Weights" (Mainen 2026).