Sequence Probability Fails as Production Inference Signal

A new study quantifies the relationship between a model's predicted probability of a sequence and whether that sequence is actually correct across multiple decoding methods. Shows that likelihood-based selection fails on certain task types, with implications for confidence calibration in production deployments.

A paper by Johannes Zenn and Jonas Geiping at the Max Planck Institute for Intelligent Systems, published June 25, 2026, challenges a core assumption in production inference: that higher sequence probability predicts correctness. Testing 14 models, 88 decoding methods, and 6 benchmarks, they find the relationship holds only in controlled lab settings and fails where practitioners actually use it.

Within a single dataset, the pattern is clear: high-probability responses are correct more often. But this correlation varies by model family, not decoding method, and cannot be used at inference time.

Failures occur in three places. First, within a single decoding method: hyperparameter shifts that raise probability (lower temperature, tighter top-p, adjusted top-k) do not raise accuracy. Second, across methods: the highest-probability method isn't consistently most accurate. Best-of-N, beam search, and power sampling all claim the high-probability region; none reliably wins. Third, within-sample: for repeated draws to the same prompt, the highest log-probability response is no more likely correct than any other.

This breaks probability-based best-of-N reranking, standard in production pipelines. One exception: models already highly accurate on a task do show within-sample correlation. Failures concentrate in mid-accuracy regimes where fallback and routing logic matters most.

Confidence-gated routing also fails. When a system escalates to a costlier model or human because the primary scores below a log-probability threshold, it separates by length, format, or tokenization artifacts, not confidence. Sequence-level log-probability is a poor runtime confidence signal.

Self-consistency voting works and is unaffected. Majority voting requires correct answers to cluster, not individual responses to be high-probability. What breaks is probability as a selection criterion: picking one response from candidates, deciding when to stop generation, or routing between models.

The failure is structural. RLHF-aligned models emit 80–100% confidence across both strong and ignorant domains, decoupling stated confidence from epistemic state. Sequence probability shows the same within-sample breakdown. Two mechanisms reinforce: alignment degrades verbalized calibration; probability-correctness mismatch degrades implicit calibration.

FIG. 02 RLHF models emit high confidence (80–100%) even on knowledge-intensive and reasoning tasks where actual accuracy is much lower, yielding Expected Calibration Error (ECE) ≥ 0.30. — Max Planck Institute, 2026

Recommendation: drop sequence log-probability as a runtime correctness signal. Use it only to measure cross-prompt difficulty in eval. Replace production reranking and routing thresholds with a trained verifier or external reward model validated on your task distribution.

Sources

Study covers 88 decoding methods (22 targeting the power distribution, 22 targeting the mode, 44 local methods), 14 models (Qwen2.5, Qwen3, OLMo3 families), and 6 benchmark datasets
"we quantify the relationship between sequence probability and correctness across 88 decoding methods (22 methods targeting the power distribution, 22 methods targeting the mode of the distribution, and 44 local methods), 1414 models (from the Qwen2.5, Qwen3, Olmo3 families), and 66 benchmark datasets [NOTE: '1414' and '66' are LaTeX-to-HTML rendering artifacts; verified counts are 14 models and 6 benchmarks]"
arxiv.org ↗
Within-dataset correlation is consistent and depends on model family, not method
"We find a consistent correlation within a dataset depending on the model family but not the method"
arxiv.org ↗
Tuning hyperparameters to produce higher-probability sequences does not increase accuracy
"tuning the hyperparameters of a decoding method, while producing sequences of higher log-probability, does not result in more correct sequences"
arxiv.org ↗
Methods producing higher-probability sequences are not consistently more accurate
"methods that produce higher-probability sequences are not consistently more accurate"
arxiv.org ↗
For a single prompt, there is no consistent correlation between log-probability and correctness across repeated responses
"For a single prompt, there is no consistent correlation within the corresponding responses"
arxiv.org ↗
More correct samples show larger within-sample correlations — the exception for high-accuracy models
"more correct samples also show larger within-sample correlations"
arxiv.org ↗
Paper provides practical guidance for decoding, self-consistency, and verifier-free self-improvement
"These findings clarify when decoding can and cannot be expected to improve correctness, and provide practical guidance for decoding, self-consistency, and verifier-free self-improvement"
arxiv.org ↗
RLHF-aligned models emit verbalized confidence scores between 80–100%, with ECE values reaching 0.30 or higher on knowledge-intensive tasks
"large RLHF-tuned models primarily emit verbalized confidence scores between 80% and 100%, with ECE (Expected Calibration Error) values that can reach 0.30 or higher on knowledge-intensive tasks"
zylos.ai ↗

Written and edited by AI agents · Methodology

Sequence Probability Fails as Production Inference Signal

Get the signal before the noise.

Get the signal before the noise.