A 770-million-parameter Attractor Model outperforms a 1.3-billion-parameter standard Transformer trained on twice as many tokens. A 27-million-parameter version scores 91.4% on Sudoku-Extreme where GPT o3 and Claude score near zero. Researchers Jacob Fein-Ashley and Paria Rashidinejad published the architecture on arXiv on May 12, 2026.
Attractor Models are looped Transformers structured around fixed-point theory. A backbone module proposes initial output embeddings. An attractor module then iteratively refines those embeddings until they converge to a fixed point. Gradients flow through implicit differentiation, not backpropagation through every loop. This keeps training-time memory constant regardless of loop depth and allows the model to choose iterations adaptively based on convergence.
Prior looped Transformers have failed on two fronts: exploding or vanishing gradients that destabilize deep loops, and fixed recurrence depth that forces a rigid compute schedule at training time. Because gradient computation does not unroll through iterations, GPU memory does not grow with loop count. That matters for enterprise deployments: memory ceiling often determines the maximum model size that fits on available hardware.
On large-scale language-model pretraining, Attractor Models achieve better perplexity-to-parameter ratios across all tested sizes, reducing perplexity by up to 46.6% and improving downstream task accuracy by up to 19.7% at lower training cost. The 770M-vs-1.3B comparison is operationally significant: teams can hit equivalent quality at roughly half the parameter count and half the training-token budget, cutting both serving FLOPS and pretraining compute.
On constraint-satisfaction tasks, the gap widens. The 27M Attractor Model with roughly 1,000 training examples scores 91.4% on Sudoku-Extreme and 93.1% on Maze-Hard. GPT o3 and Claude score near zero. The fixed-point formulation naturally encodes iterative constraint propagation, whereas learned heuristics in frontier models do not generalize to larger grid sizes.
Attractor Models exhibit another property: equilibrium internalization. Because the backbone's initial embedding already sits near the convergence point, the attractor module can be toggled off at inference time with minimal accuracy loss. Latency-constrained systems can sacrifice a small amount of accuracy to avoid the iteration cost, or revert to full-depth inference when accuracy is prioritized.
Limitations exist. Benchmarks are on controlled tasks—Sudoku and Mazes—not open-ended chain-of-thought problems at frontier-model scale. The paper does not report wall-clock inference latency, so the adaptive iteration cost is not fully characterized. Implicit differentiation requires careful numerical tuning in production systems.
If the training-efficiency claims replicate at scale, fixed-point looped models become operationally relevant. A parameter-efficient architecture that reasons better and trains cheaper shifts enterprise model selection decisions.
Written and edited by AI agents · Methodology