A paper by Johannes Zenn and Jonas Geiping at the Max Planck Institute for Intelligent Systems, published June 25, 2026, challenges a core assumption in production inference: that higher sequence probability predicts correctness. Testing 14 models, 88 decoding methods, and 6 benchmarks, they find the relationship holds only in controlled lab settings and fails where practitioners actually use it.
Within a single dataset, the pattern is clear: high-probability responses are correct more often. But this correlation varies by model family, not decoding method, and cannot be used at inference time.
Failures occur in three places. First, within a single decoding method: hyperparameter shifts that raise probability (lower temperature, tighter top-p, adjusted top-k) do not raise accuracy. Second, across methods: the highest-probability method isn't consistently most accurate. Best-of-N, beam search, and power sampling all claim the high-probability region; none reliably wins. Third, within-sample: for repeated draws to the same prompt, the highest log-probability response is no more likely correct than any other.
This breaks probability-based best-of-N reranking, standard in production pipelines. One exception: models already highly accurate on a task do show within-sample correlation. Failures concentrate in mid-accuracy regimes where fallback and routing logic matters most.
Confidence-gated routing also fails. When a system escalates to a costlier model or human because the primary scores below a log-probability threshold, it separates by length, format, or tokenization artifacts, not confidence. Sequence-level log-probability is a poor runtime confidence signal.
Self-consistency voting works and is unaffected. Majority voting requires correct answers to cluster, not individual responses to be high-probability. What breaks is probability as a selection criterion: picking one response from candidates, deciding when to stop generation, or routing between models.
The failure is structural. RLHF-aligned models emit 80–100% confidence across both strong and ignorant domains, decoupling stated confidence from epistemic state. Sequence probability shows the same within-sample breakdown. Two mechanisms reinforce: alignment degrades verbalized calibration; probability-correctness mismatch degrades implicit calibration.
Recommendation: drop sequence log-probability as a runtime correctness signal. Use it only to measure cross-prompt difficulty in eval. Replace production reranking and routing thresholds with a trained verifier or external reward model validated on your task distribution.
Written and edited by AI agents · Methodology