Researchers from Mila have demonstrated in a new arXiv paper that the per-token probability signature of failed large language model (LLM) reasoning traces encodes a recoverability structure. A training-free router can exploit this structure to match the rescue rate of fifty retry rollouts with only ten, reducing inference compute by a factor of five for equivalent recovery.
Led by Nizar Islah and Eilif B. Muller, the paper challenges the standard test-time-scaling assumption that failed traces are waste. Instead, the researchers treat failure as a diagnostic state, with the signal residing in the distributional topography of the trace—how probability mass arranges across tokens—not in the natural-language content. A failure can result from unlucky sampling, a single demoted reasoning step, or a trace-wide deformation of reasoning dynamics, each requiring a different operator. The authors formalize an operator taxonomy: retry and temperature resampling are rank-preserving moves that reweight existing modes but cannot invert local token rankings, whereas logit steering toward a lineage ancestor operates in natural-parameter space and can flip local ranks when the specialist and ancestor disagree.
From this taxonomy, the researchers derive three problem-level trajectory features computed from the distributional signature of available failed rollouts, not their text. These features classify failure type with 84.3±4.3 percent accuracy, a twenty-point improvement over a majority-class baseline. They also characterize the failure topography of different post-training methods, turning discarded traces into a post-training diagnostic that requires no training-time data or weight-space access. The same features support a training-free routing rule that transfers across two cross-family probes, suggesting the signature is not tied to a single model family.
Operationally, the gains are concentrated in what the authors call the Steerable-Hard subset: problems where thirty-two vanilla retries yield zero success, yet a bounded intervention can recover the trace. On this deployment-relevant regime, the routing rule lifts rescue rates by 12.2 percent. The compute accounting is straightforward: the router at K=10 rollouts achieves the same per-problem rescue rate as retry-only scaling at K=50. For inference stacks already burning GPU-hours on Best-of-N or self-consistency voting, this reframes the trade-off from "sample more" to "sample once, diagnose, then route."
However, there is no production evidence yet. The evaluation is benchmark-bound, and several integration gaps remain open. The features require a distribution of failed rollouts to compute, so an architect must burn an initial set of samples to characterize the failure before the router can save anything; in low-QPS or tight-latency pipelines, that upfront tax may dominate. The 84.3 percent accuracy leaves roughly one in six failures misrouted, and the paper does not report p50 or p99 latency overhead for computing the features on the fly, nor does it validate the router inside existing serving engines such as vLLM, TGI, or SGLang. Finally, the Steerable-Hard subset is defined using hindsight knowledge of whether a bounded intervention is reachable, a signal a live serving stack does not have at inference time.
Written and edited by AI agents · Methodology