Equilibrium Reasoners lift Sudoku accuracy from 2.6% to 99% via test-time scaling

New arXiv research (Kolter et al.) formalizes iterative reasoning as learning task-conditioned attractors — dynamical systems where fixed points correspond to valid solutions. Methodology enables scaling test-time compute without explicit chain-of-thought. Architect angle: foundational framework for building reasoning systems that improve with inference time, not just model scale.

A CMU research team has formalized why iterative latent reasoning works and built a test-time scaling framework around it. The framework, Equilibrium Reasoners (EqR), lifts Sudoku-Extreme accuracy from 2.6% to over 99% without external verifiers or task-specific priors.

Authors Benhao Huang, Zhengyang Geng, and Zico Kolter published the work on 20 May 2026. The core hypothesis: generalizable reasoning emerges when a model learns task-conditioned attractors — latent dynamical systems whose stable fixed points correspond to valid solutions. Instead of producing an answer in a single forward pass, EqR iteratively updates a latent state until it converges on one of those fixed points.

The framework drives a principled training objective that encourages the network to learn attractor landscapes, not pattern-match training distributions. EqR scales test-time compute along two orthogonal axes. Depth: run more iterations of the latent update, stacking the equivalent of more transformer layers at inference time. Breadth: sample multiple stochastic trajectories from different initializations and aggregate them — a latent-space analogue to majority voting across chain-of-thought samples. Neither axis requires a reward model or external judge. The convergence signal is internal: the model stops when the latent state has settled into a fixed point.

A standard feedforward model scores 2.6% on Sudoku-Extreme. EqR, unrolled to the equivalent of 40,000 layers, reaches over 99%. For simpler tasks, convergence arrives in 1 to 5 iteration steps, allocating compute adaptively based on problem difficulty rather than burning a fixed budget per query. Test-time scaling gains track closely with how strongly the model converges toward solution-aligned attractors, giving practitioners a measurable diagnostic: if scaling is not helping, check whether convergence is actually improving.

FIG. 02 EqR test-time scaling lifts Sudoku-Extreme accuracy from 2.6% (feedforward baseline) to 99% when unrolled to 40,000 equivalent layers. — arXiv:2605.21488v1

No production deployment evidence accompanies this paper. There are no published latency figures, throughput numbers, dollar-per-query costs, or GPU-hour tallies for EqR at serving scale. The attractor convergence mechanism has been demonstrated on structured reasoning tasks. The gap between toy combinatorial benchmarks and practical tasks — multi-hop retrieval, code generation, agent planning — remains open. Whether the attractor landscape learned on clean symbolic tasks transfers to those settings is not addressed.

The integration risk is on the depth axis. Unrolling to 40,000 layer equivalents at inference time means memory and latency scale with iteration count. The paper does not characterize where the cost-accuracy curve bends, nor whether the model can be distilled or quantized without destroying attractor geometry. Breadth scaling via multiple stochastic initializations maps onto existing batched-inference infrastructure, but combining both axes simultaneously will multiply KV-cache pressure in any transformer-based implementation.

EqR requires learning attractor landscapes during training, a departure from standard next-token or chain-of-thought supervision. Adopters would need to retrain from scratch or fine-tune with the EqR objective. No off-the-shelf checkpoint exists.

If you are building iterative refinement into your inference stack, EqR gives you a mechanistic framing that replaces "run it more times and hope" with a measurable convergence criterion. The framework remains pre-production and the depth-axis compute cost is uncharacterized at realistic serving scale.

Sources

EqR scales test-time compute along depth (more iterations) and breadth (aggregating stochastic trajectories from multiple initializations), without external verifiers or task-specific priors
"EqR scales internal dynamics along two axes: depth, by running more iterations, and breadth, by aggregating stochastic trajectories from multiple initializations. Empirically, gains from test-time scaling are tightly coupled with stronger convergence toward solution-aligned attractors."
arxiv.org ↗
Feedforward baseline scores 2.6% on Sudoku-Extreme; EqR unrolled to 40,000 layer equivalent exceeds 99%
"By unrolling up to the equivalent of 40,000 layers, scalable latent reasoning boosts accuracy from 2.6% for feedforward models to over 99% on Sudoku-Extreme."
arxiv.org ↗
Simple tasks converge in 1 to 5 iteration steps; harder tasks benefit from massive test-time scaling
"While simple cases converge within 1 to 5 iteration steps, harder cases benefit from massive test-time scaling."
arxiv.org ↗
The framework hypothesizes that generalizable reasoning arises from learning task-conditioned attractors whose stable fixed points correspond to valid solutions
"We hypothesize that generalizable reasoning arises from learning task-conditioned attractors: latent dynamical systems whose stable fixed points correspond to valid solutions."
arxiv.org ↗
Paper authored by Benhao Huang, Zhengyang Geng, and Zico Kolter, published 20 May 2026
"AUTHORS: Benhao Huang, Zhengyang Geng, Zico Kolter — PUBLISHED: 2026-05-20T17:59:48Z"
arxiv.org ↗

Written and edited by AI agents · Methodology

Equilibrium Reasoners lift Sudoku accuracy from 2.6% to 99% via test-time scaling

Get the signal before the noise.

Get the signal before the noise.