Router Matching 50 Retries with 10 Samples Cuts LLM Test-Time Compute

Researchers from Mila have demonstrated in a new arXiv paper that the per-token probability signature of failed large language model (LLM) reasoning traces encodes a recoverability structure. A training-free router can exploit this structure to match the rescue rate of fifty retry rollouts with only ten, reducing inference compute by a factor of five for equivalent recovery.

Led by Nizar Islah and Eilif B. Muller, the paper challenges the standard test-time-scaling assumption that failed traces are waste. Instead, the researchers treat failure as a diagnostic state, with the signal residing in the distributional topography of the trace—how probability mass arranges across tokens—not in the natural-language content. A failure can result from unlucky sampling, a single demoted reasoning step, or a trace-wide deformation of reasoning dynamics, each requiring a different operator. The authors formalize an operator taxonomy: retry and temperature resampling are rank-preserving moves that reweight existing modes but cannot invert local token rankings, whereas logit steering toward a lineage ancestor operates in natural-parameter space and can flip local ranks when the specialist and ancestor disagree.

From this taxonomy, the researchers derive three problem-level trajectory features computed from the distributional signature of available failed rollouts, not their text. These features classify failure type with 84.3±4.3 percent accuracy, a twenty-point improvement over a majority-class baseline. They also characterize the failure topography of different post-training methods, turning discarded traces into a post-training diagnostic that requires no training-time data or weight-space access. The same features support a training-free routing rule that transfers across two cross-family probes, suggesting the signature is not tied to a single model family.

FIG. 02 Trajectory features achieve 84.3% classification accuracy (+20% over baseline) in predicting failure type. — Mila arXiv 2606.05145

Operationally, the gains are concentrated in what the authors call the Steerable-Hard subset: problems where thirty-two vanilla retries yield zero success, yet a bounded intervention can recover the trace. On this deployment-relevant regime, the routing rule lifts rescue rates by 12.2 percent. The compute accounting is straightforward: the router at K=10 rollouts achieves the same per-problem rescue rate as retry-only scaling at K=50. For inference stacks already burning GPU-hours on Best-of-N or self-consistency voting, this reframes the trade-off from "sample more" to "sample once, diagnose, then route."

FIG. 03 Router at K=10 rollouts achieves comparable rescue rate to standard retry@50, reducing test-time compute 5×. — Mila arXiv 2606.05145

However, there is no production evidence yet. The evaluation is benchmark-bound, and several integration gaps remain open. The features require a distribution of failed rollouts to compute, so an architect must burn an initial set of samples to characterize the failure before the router can save anything; in low-QPS or tight-latency pipelines, that upfront tax may dominate. The 84.3 percent accuracy leaves roughly one in six failures misrouted, and the paper does not report p50 or p99 latency overhead for computing the features on the fly, nor does it validate the router inside existing serving engines such as vLLM, TGI, or SGLang. Finally, the Steerable-Hard subset is defined using hindsight knowledge of whether a bounded intervention is reachable, a signal a live serving stack does not have at inference time.

Sources

Three trajectory features classify failure type with 84.3±4.3% accuracy, +20% over a majority-class baseline
"They cluster failures into stable regimes, characterize the failure topography of different post-training methods (84.3±4.3% accuracy, +20% over a majority-class baseline)"
arxiv.org ↗
Training-free routing rule lifts rescue by +12.2% on the Steerable-Hard subset (failures where retry@32=0)
"support a training-free routing rule that lifts rescue by +12.2% on the deployment-relevant Steerable-Hard subset (failures where retry is insufficient and a bounded intervention is reachable)"
arxiv.org ↗
Router at K=10 rollouts matches per-problem rescue rate of retry at K=50, a 5× compute reduction
"The cyan crosshairs mark Feature-only routing at K=10: it matches the per-problem rescue rate of retry at K=50 using substantially less compute."
arxiv.org ↗
Signal is distributional—the per-token probability signature of the trace—not the natural-language content
"The signal we read is distributional, the per-token probability signature of the trace rather than its natural-language content, which separates this diagnostic from verbal self-correction that re-reads and critiques the text."
arxiv.org ↗
Retry and temperature resampling are rank-preserving; logit steering toward a lineage ancestor acts in natural-parameter space and can invert local ranks
"Retry and temperature-based resampling are rank-preserving: they can reweight the specialist's local distribution but cannot make a lower-ranked token become the local mode. Logit steering toward a lineage ancestor acts in natural-parameter space (it averages logits, not probabilities) and can invert local ranks when the specialist and ancestor disagree."
arxiv.org ↗
Features and routing rule transfer across two cross-family probes with no weight-space access required
"The features and the routing rule transfer across two cross-family probes. The same three features thus convert failed traces from discarded data into a diagnostic object, supporting test-time routing and post-training analysis without training-time or weight-space access."
arxiv.org ↗

Written and edited by AI agents · Methodology

Router Matching 50 Retries with 10 Samples Cuts LLM Test-Time Compute

Get the signal before the noise.

Get the signal before the noise.