First-Token Entropy Rivals Multi-Sample Hallucination Detection

A single greedy decoding pass detects LLM hallucinations as reliably as expensive multi-sample consensus methods, according to new research from Mina Gabriel published on arXiv this week. The finding directly challenges the assumption that hallucination detection requires repeated inference calls — a costly default in production pipelines.

The method, called phi_first, computes the normalized entropy of the top-K logits at the first content-bearing answer token during a standard greedy decode. No additional sampling, no external classifier, no natural language inference cluster step.

Across three 7–8B instruction-tuned models and two closed-book short-answer factual benchmarks, phi_first achieved a mean AUROC of 0.820. Standard surface-form self-consistency — which generates multiple sampled answers and measures lexical agreement — scored 0.791. Semantic self-consistency, which clusters sampled answers by meaning using natural language inference to handle lexical variation, scored 0.793. The single-pass metric matched or exceeded both multi-sample approaches.

FIG. 02 phi_first achieves 0.820 mean AUROC across three 7–8B models and two benchmarks, outperforming consensus baselines. — arxiv.org/abs/2605.05166v1

Combining phi_first with semantic agreement in an ensemble yielded only a marginal AUROC improvement over phi_first alone. Logit entropy and agreement-across-samples are moderately to strongly correlated, meaning most of the uncertainty information in agreement signals is already present at the first token. The multi-sample overhead buys very little additional discrimination.

For enterprise teams running customer-facing LLM pipelines at scale, the cost implication is direct. Self-consistency-based hallucination detection typically multiplies inference compute by the number of samples drawn — often five to ten — before any answer is returned. Semantic self-consistency adds a separate NLI inference pass on top of that. Replacing both with a logit-entropy read on an already-scheduled greedy decode eliminates that multiplier entirely and removes the latency spike inherent to parallel sampling. At high throughput, this difference determines whether hallucination checking is economically feasible in the request path or must be deferred to async auditing.

The approach also simplifies system architecture. Multi-sample consensus requires coordinating parallel generation requests, aggregating outputs, and either running an NLI model or executing string-matching logic. phi_first integrates into any serving stack that exposes per-token logit distributions — a capability already present in vLLM, TGI, and TensorRT-LLM. There are no new model weights, no fine-tuning, and no additional model endpoints to manage.

Validation is limited to closed-book factual QA with 7–8B parameter models. Whether the signal holds for longer-form generation, larger model classes, or retrieval-augmented tasks remains unvalidated. The paper does not test phi_first on chain-of-thought outputs or code generation, where the first content-bearing token is less semantically loaded. Calibration across domains with varying base rates of hallucination is also an open question.

Gabriel argues that phi_first should be reported as the default low-cost baseline in any uncertainty estimation study before moving to sampling-based methods. For practitioners: before deploying a multi-sample consensus check in a latency-sensitive pipeline, measure what first-token entropy buys. In most factual QA settings, the answer is nearly everything.

Sources

phi_first achieves a mean AUROC of 0.820 across three 7-8B instruction-tuned models and two benchmarks
"phi_first achieves a mean AUROC of 0.820, compared with 0.793 for semantic agreement and 0.791 for standard surface-form self-consistency"
arxiv.org ↗
Semantic self-consistency scored 0.793 AUROC; standard surface-form self-consistency scored 0.791
"phi_first achieves a mean AUROC of 0.820, compared with 0.793 for semantic agreement and 0.791 for standard surface-form self-consistency"
arxiv.org ↗
phi_first is computed from the normalized entropy of the top-K logits at the first content-bearing answer token of a single greedy decode
"first-token confidence, phi_first, computed from the normalized entropy of the top-K logits at the first content-bearing answer token of a single greedy decode"
arxiv.org ↗
Combining phi_first with semantic agreement yields only a small AUROC improvement over phi_first alone
"combining the two signals yields only a small AUROC improvement over phi_first alone"
arxiv.org ↗
phi_first is moderately to strongly correlated with semantic agreement, per a subsumption test
"A subsumption test shows that phi_first is moderately to strongly correlated with semantic agreement"
arxiv.org ↗
Much of the uncertainty information in multi-sample agreement is already available in the model's initial token distribution
"much of the uncertainty information captured by multi-sample agreement is already available in the model's initial token distribution"
arxiv.org ↗
The paper argues phi_first should be the default low-cost baseline before invoking sampling-based uncertainty estimation
"We argue that phi_first should be reported as a default low-cost baseline before invoking sampling-based uncertainty estimation"
arxiv.org ↗

Written and edited by AI agents · Methodology

First-Token Entropy Rivals Multi-Sample Hallucination Detection

Get the signal before the noise.

Get the signal before the noise.