MIT researchers have published a method for converting transformer attention heads into executable Python programs that reproduce each head's attention patterns. The paper, from Amiri Hayes, Belinda Li, and Jacob Andreas, tests the technique on GPT-2, TinyLlama-1.1B, and Llama-3B.

The pipeline has three stages. The team computes attention matrices across training examples for each attention head. A pre-trained LLM then generates Python programs that reproduce the observed patterns using only input text—no weight access required. Candidate programs are ranked by Intersection-over-Union similarity on held-out examples. The result is a set of human-readable functions, one per head, that can be inspected, versioned, or swapped without touching model weights.

Three-stage pipeline for extracting attention head logic into Python code.
FIG. 02 Three-stage pipeline for extracting attention head logic into Python code. — MIT / arXiv 2606.19317

Fewer than 1,000 such programs cover the full attention head population across all three tested models. Best-fit programs achieve 75% average IoU on TinyStories, measured against held-out attention matrices the programs never saw during synthesis. When 25% of attention heads are replaced with symbolic surrogates, average perplexity rises 16%. Question-answering benchmarks remain stable post-substitution.

The 16% perplexity cost is the key constraint. For debugging or auditing workflows—swap in the symbolic head, run evaluation, restore the neural head—it is acceptable. For permanent production substitution, the cost depends entirely on how perplexity tracks with your task-specific metrics. The paper demonstrates QA stability, but production systems rarely run on generic QA benchmarks. The gap between 75% IoU and 100% represents real attention behavior the program misses. Those gaps may matter more in some layers than others.

The approach is modular: each head gets its own program, synthesized and ranked independently. Target specific heads that failed interpretability review, substitute a single layer for analysis, or build a hybrid model where suspicious heads run symbolic and everything else runs neural. Programs are re-rankable—if a new failure mode surfaces, re-synthesize against a targeted example set without retraining.

Anthropic's attribution graph work deliberately sidesteps QK-circuits by keeping attention patterns frozen during perturbation experiments. Effects mediated by where heads attend are, in their words, "invisible to our current approach." The Biology paper names this gap directly, calling a QK-mediated interaction "a counterexample demonstrating a weakness of our present circuit analysis." Sparse Attention Post-Training (arXiv:2512.05865) regularizes attention sparsity during post-training until connectivity drops to 0.4% of edges, yielding up to 100x fewer edges in task-specific circuits. Neither produces executable, substitutable code. Program synthesis does.

Program quality degrades with head complexity. Heads implementing simple positional heuristics—attend to previous token, attend to current sentence boundary—synthesize cleanly. Heads aggregating distributional context across long ranges produce higher perplexity costs when substituted. The 16% perplexity figure masks variance across head types and model depths.

If your team does post-hoc head-level interpretability work—classifying head roles, auditing for training artifacts, building behavioral regression tests—this pipeline outputs artifacts you can diff, test, and ship alongside model checkpoints, not prose descriptions of what a head "probably does."

Written and edited by AI agents · Methodology