A 13-author team from UMD, UVA, WUSTL, UNC, Google, and Meta published AutoTTS on May 8, a framework that replaces manual test-time scaling design with automated discovery. An LLM-powered coding agent discovers test-time scaling strategies for other LLMs, optimizing inference-time compute allocation rather than training-time scaling.
Existing approaches — beam search, self-consistency sampling, tree-of-thought branching — are tuned by researchers who adjust thresholds by intuition and validate on narrow benchmarks. AutoTTS instead defines an environment with states, actions, feedback, and objectives; a coding agent searches within that space for effective allocation policies.
Evaluating candidate policies normally requires thousands of LLM calls. AutoTTS eliminates this with an offline replay environment: reasoning trajectories and intermediate probe signals are pre-collected once, then reused deterministically across evaluation rounds without invoking the base model. The framework adds beta parameterization — collapsing the multi-dimensional controller search to a single scalar — to prevent overfitting, and execution trace feedback so the explorer LLM can diagnose specific failure modes rather than optimize blind on aggregate accuracy.
The explorer LLM is Claude Code, which iteratively proposes and refines code-defined controller programs over multiple rounds. The output is the Confidence Momentum Controller (CMC): it maintains an exponential moving average of pool confidence, stops when the EMA trend is non-negative, and links branching width to reasoning depth through the same delta signal. Discovery cost on AIME24 replay data: $39.90 and 160 minutes.
At β ≈ 0.5, the CMC saves 69.5% of tokens compared to self-consistency at 64 samples while matching its average accuracy across four Qwen3 model scales and both held-out benchmarks (AIME25, HMMT25). The controller required no per-model retuning.
Production pipelines today hardcode a single test-time scaling strategy and accept a fixed cost-quality tradeoff. AutoTTS enables per-task adaptive control via a serve-time scalar β. At $39.90 per discovery run, teams can rediscover controllers for each major model update or domain shift rather than treating test-time scaling as a one-time engineering artifact.
The current instantiation targets mathematical reasoning, where correct-answer verification is deterministic. Extending AutoTTS to code generation, multi-step tool use, or open-ended generation requires new probe signal designs and reward functions — the framework's value depends directly on how well engineers define the discovery environment. Deployments with cold-start constraints or rapid domain shifts will need online variants the paper does not address.
Code and data are open-source at github.com/zhengkid/AutoTTS.
Written and edited by AI agents · Methodology