A CMU research team has formalized why iterative latent reasoning works and built a test-time scaling framework around it. The framework, Equilibrium Reasoners (EqR), lifts Sudoku-Extreme accuracy from 2.6% to over 99% without external verifiers or task-specific priors.
Authors Benhao Huang, Zhengyang Geng, and Zico Kolter published the work on 20 May 2026. The core hypothesis: generalizable reasoning emerges when a model learns task-conditioned attractors — latent dynamical systems whose stable fixed points correspond to valid solutions. Instead of producing an answer in a single forward pass, EqR iteratively updates a latent state until it converges on one of those fixed points.
The framework drives a principled training objective that encourages the network to learn attractor landscapes, not pattern-match training distributions. EqR scales test-time compute along two orthogonal axes. Depth: run more iterations of the latent update, stacking the equivalent of more transformer layers at inference time. Breadth: sample multiple stochastic trajectories from different initializations and aggregate them — a latent-space analogue to majority voting across chain-of-thought samples. Neither axis requires a reward model or external judge. The convergence signal is internal: the model stops when the latent state has settled into a fixed point.
A standard feedforward model scores 2.6% on Sudoku-Extreme. EqR, unrolled to the equivalent of 40,000 layers, reaches over 99%. For simpler tasks, convergence arrives in 1 to 5 iteration steps, allocating compute adaptively based on problem difficulty rather than burning a fixed budget per query. Test-time scaling gains track closely with how strongly the model converges toward solution-aligned attractors, giving practitioners a measurable diagnostic: if scaling is not helping, check whether convergence is actually improving.
No production deployment evidence accompanies this paper. There are no published latency figures, throughput numbers, dollar-per-query costs, or GPU-hour tallies for EqR at serving scale. The attractor convergence mechanism has been demonstrated on structured reasoning tasks. The gap between toy combinatorial benchmarks and practical tasks — multi-hop retrieval, code generation, agent planning — remains open. Whether the attractor landscape learned on clean symbolic tasks transfers to those settings is not addressed.
The integration risk is on the depth axis. Unrolling to 40,000 layer equivalents at inference time means memory and latency scale with iteration count. The paper does not characterize where the cost-accuracy curve bends, nor whether the model can be distilled or quantized without destroying attractor geometry. Breadth scaling via multiple stochastic initializations maps onto existing batched-inference infrastructure, but combining both axes simultaneously will multiply KV-cache pressure in any transformer-based implementation.
EqR requires learning attractor landscapes during training, a departure from standard next-token or chain-of-thought supervision. Adopters would need to retrain from scratch or fine-tune with the EqR objective. No off-the-shelf checkpoint exists.
If you are building iterative refinement into your inference stack, EqR gives you a mechanistic framing that replaces "run it more times and hope" with a measurable convergence criterion. The framework remains pre-production and the depth-axis compute cost is uncharacterized at realistic serving scale.
Written and edited by AI agents · Methodology