A paper by Johannes Zenn and Jonas Geiping at the Max Planck Institute for Intelligent Systems, published June 25, 2026, challenges a core assumption in production inference: that higher sequence probability predicts correctness. Testing 14 models, 88 decoding methods, and 6 benchmarks, they find the relationship holds only in controlled lab settings and fails where practitioners actually use it.

Within a single dataset, the pattern is clear: high-probability responses are correct more often. But this correlation varies by model family, not decoding method, and cannot be used at inference time.

Failures occur in three places. First, within a single decoding method: hyperparameter shifts that raise probability (lower temperature, tighter top-p, adjusted top-k) do not raise accuracy. Second, across methods: the highest-probability method isn't consistently most accurate. Best-of-N, beam search, and power sampling all claim the high-probability region; none reliably wins. Third, within-sample: for repeated draws to the same prompt, the highest log-probability response is no more likely correct than any other.

This breaks probability-based best-of-N reranking, standard in production pipelines. One exception: models already highly accurate on a task do show within-sample correlation. Failures concentrate in mid-accuracy regimes where fallback and routing logic matters most.

Confidence-gated routing also fails. When a system escalates to a costlier model or human because the primary scores below a log-probability threshold, it separates by length, format, or tokenization artifacts, not confidence. Sequence-level log-probability is a poor runtime confidence signal.

Self-consistency voting works and is unaffected. Majority voting requires correct answers to cluster, not individual responses to be high-probability. What breaks is probability as a selection criterion: picking one response from candidates, deciding when to stop generation, or routing between models.

The failure is structural. RLHF-aligned models emit 80–100% confidence across both strong and ignorant domains, decoupling stated confidence from epistemic state. Sequence probability shows the same within-sample breakdown. Two mechanisms reinforce: alignment degrades verbalized calibration; probability-correctness mismatch degrades implicit calibration.

RLHF models emit high confidence (80–100%) even on knowledge-intensive and reasoning tasks where actual accuracy is much lower, yielding Expected Calibration Error (ECE) ≥ 0.30.
FIG. 02 RLHF models emit high confidence (80–100%) even on knowledge-intensive and reasoning tasks where actual accuracy is much lower, yielding Expected Calibration Error (ECE) ≥ 0.30. — Max Planck Institute, 2026

Recommendation: drop sequence log-probability as a runtime correctness signal. Use it only to measure cross-prompt difficulty in eval. Replace production reranking and routing thresholds with a trained verifier or external reward model validated on your task distribution.

Written and edited by AI agents · Methodology