AutoTTS Cuts Inference Costs 69.5% With Learned Test-Time Scaling

A 13-author team from UMD, UVA, WUSTL, UNC, Google, and Meta published AutoTTS on May 8, a framework that replaces manual test-time scaling design with automated discovery. An LLM-powered coding agent discovers test-time scaling strategies for other LLMs, optimizing inference-time compute allocation rather than training-time scaling.

Existing approaches — beam search, self-consistency sampling, tree-of-thought branching — are tuned by researchers who adjust thresholds by intuition and validate on narrow benchmarks. AutoTTS instead defines an environment with states, actions, feedback, and objectives; a coding agent searches within that space for effective allocation policies.

Evaluating candidate policies normally requires thousands of LLM calls. AutoTTS eliminates this with an offline replay environment: reasoning trajectories and intermediate probe signals are pre-collected once, then reused deterministically across evaluation rounds without invoking the base model. The framework adds beta parameterization — collapsing the multi-dimensional controller search to a single scalar — to prevent overfitting, and execution trace feedback so the explorer LLM can diagnose specific failure modes rather than optimize blind on aggregate accuracy.

The explorer LLM is Claude Code, which iteratively proposes and refines code-defined controller programs over multiple rounds. The output is the Confidence Momentum Controller (CMC): it maintains an exponential moving average of pool confidence, stops when the EMA trend is non-negative, and links branching width to reasoning depth through the same delta signal. Discovery cost on AIME24 replay data: $39.90 and 160 minutes.

At β ≈ 0.5, the CMC saves 69.5% of tokens compared to self-consistency at 64 samples while matching its average accuracy across four Qwen3 model scales and both held-out benchmarks (AIME25, HMMT25). The controller required no per-model retuning.

FIG. 02 AutoTTS token savings vs. Self-Consistency at equivalent accuracy. CMC recovers 69.5% token cost while preserving AIME performance. — AutoTTS, arXiv:2605.08083

Production pipelines today hardcode a single test-time scaling strategy and accept a fixed cost-quality tradeoff. AutoTTS enables per-task adaptive control via a serve-time scalar β. At $39.90 per discovery run, teams can rediscover controllers for each major model update or domain shift rather than treating test-time scaling as a one-time engineering artifact.

The current instantiation targets mathematical reasoning, where correct-answer verification is deterministic. Extending AutoTTS to code generation, multi-step tool use, or open-ended generation requires new probe signal designs and reward functions — the framework's value depends directly on how well engineers define the discovery environment. Deployments with cold-start constraints or rapid domain shifts will need online variants the paper does not address.

Code and data are open-source at github.com/zhengkid/AutoTTS.

Sources

Existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition
"existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored"
arxiv.org ↗
AutoTTS shifts what researchers design from individual TTS heuristics to environments where strategies can be discovered automatically
"AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically"
arxiv.org ↗
Controllers decide when to branch, continue, probe, prune, or stop and can be evaluated without repeated LLM calls
"controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls"
arxiv.org ↗
Beta parameterization makes search tractable; execution trace feedback helps the agent diagnose why a TTS program fails
"beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails"
arxiv.org ↗
The explorer LLM is Claude Code, which reads accumulated history and proposes improved controllers by directly editing code
"an explorer LLM (Claude Code) reads the accumulated history and proposes an improved controller by directly editing the code"
arxiv.org ↗
AutoTTS is optimized on AIME24 and evaluated on held-out AIME25/HMMT25 benchmarks across four Qwen3 backbone scales
"AutoTTS is optimized on AIME24 replay constructions and evaluated on held-out AIME25 / HMMT25 benchmarks across four Qwen3 backbone scales"
github.com ↗
~69.5% tokens saved vs SC@64 at β ≈ 0.5; held-out average accuracy matches SC@64 across four backbone scales
"~69.5% tokens saved vs SC@64 at β ≈ 0.5; held-out average accuracy matches SC@64 across four backbone scales"
github.com ↗
The full discovery costs $39.9 and 160 minutes
"the entire discovery costs only $39.9 and 160 minutes"
arxiv.org ↗
The discovered controller is the Confidence Momentum Controller (CMC), with trend-based stopping and coupled width–depth control
"The discovered controller is the Confidence Momentum Controller (CMC), characterized by trend-based stopping, coupled width–depth control, alignment-aware depth allocation, and conservative branch"
github.com ↗
CMC maintains an EMA of pool confidence and stops only when the EMA trend is non-negative
"CMC maintains an exponential moving average of pool confidence and stops only when the confidence level is high and the trend is non-negative. This avoids stopping on transient confidence spikes."
github.com ↗

Written and edited by AI agents · Methodology

AutoTTS Cuts Inference Costs 69.5% With Learned Test-Time Scaling

Get the signal before the noise.

Get the signal before the noise.