CLExEval Exposes 62% Accuracy Collapse Under Information Scarcity

Your LLM just scored 95% on a clinical reasoning benchmark. Reduce information to bare essentials and watch it fall to 32.5%. That is the core finding from CLExEval, a human-in-the-loop evaluation framework published June 30, 2026 by researchers at MBZUAI, IIT Madras, and Calicut Medical College.

FIG. 02 Diagnostic accuracy plummets from 95% to 32.5% when information is sparse—a signal that fluent-seeming outputs mask hidden retrieval failures. — CLExEval / arxiv 2606.31608

The paper targets what the authors call the "evaluation illusion": a model produces a fluent, well-structured clinical explanation that reads as competent but points to the wrong diagnosis. Standard benchmarks score the output high. Human physicians score it zero. The gap is not noise — it is a systematic property of how current LLMs generate text and how current automated evaluators reward surface coherence over diagnostic precision.

CLExEval is built on RARECASE-200, a clinician-curated set of 40 rare diagnostic cases. From those 40 cases, the team generated 200 clinical reasoning traces and collected 5,600 expert-physician annotations at Calicut Medical College. The evaluation framework applies progressive information masking across four levels (0–3): Level 0 gives the model the full clinical record; Level 3 strips it to minimal cues, simulating the early-stage diagnostic uncertainty that attending physicians face every day. Models fail visibly at Level 3.

Three failure modes surface with quantified frequency. Verbosity bias: GPT-4o-mini's diagnostic accuracy drops from 95.0% at full information to 32.5% at Level 3 — a 62.5-point collapse caused by the model's inability to reason under information scarcity. Hidden knowledge paradox: a specialist model reaches 92.5% maximum diagnostic potential when conditions are optimal, but cannot reliably retrieve that knowledge in verbose contexts where irrelevant text dilutes the signal. Reasoning-to-output mismatch: 68.6% of tested cases show the correct diagnosis appearing in the model's reasoning trace but not in the final answer. A physician reading only the conclusion would be misled even though the model "knew" the answer.

The LLM-as-a-Judge results are worse. On a human-verified failure set of 142 outputs confirmed wrong by physician consensus, GPT-4o-mini passed 47.9% of them. HuatuoGPT-o1 passed 100% of the validly scored failures and displayed a positive self-preference bias when evaluating its own outputs. The authors formalize this as HAR (Hallucination Approval Rate) and define the evaluation illusion mathematically as Δ = Communication − Precision.

FIG. 03 Models judging their own failures: GPT-4o-mini approved nearly half of confirmed clinical errors; HuatuoGPT-o1 approved all of them. — CLExEval / arxiv 2606.31608

The framework introduces three diagnostic metrics: ROM (Reasoning-Output Mismatch), ISS (Information Sensitivity Score), and MVR (Maximum Validity Rate). Together they distinguish between a model that genuinely lacks clinical knowledge, a model that has the knowledge but cannot express it reliably, and a model that degrades specifically under information scarcity. That distinction matters for remediation: fine-tuning on more clinical text will not fix a model whose problem is ROM, not knowledge coverage.

For teams building or vetting LLMs for regulated clinical applications — EHR copilots, differential diagnosis assistants, triage routing — the operational takeaway is straightforward: benchmark scores at Level 0 are not a safe proxy for deployment-condition performance. A 47.9% false-pass rate from an automated judge means your eval pipeline is producing false confidence at roughly coin-flip odds. Add physician ground truth or do not ship.

Sources

GPT-4o-mini diagnostic accuracy drops from 95.0% to 32.5% under information scarcity (verbosity bias)
"verbosity bias, where GPT-4o-mini's diagnostic accuracy drops from 95.0% to 32.5% under information scarcity"
arxiv.org ↗
A specialist model reaches 92.5% maximum diagnostic potential but fails to retrieve that knowledge reliably in verbose contexts (hidden knowledge paradox)
"a hidden knowledge paradox, where a specialist model reaches 92.5% maximum diagnostic potential but fails to retrieve that knowledge reliably in verbose contexts"
arxiv.org ↗
68.6% reasoning-to-output mismatch: correct diagnoses appear in reasoning traces but are not reflected in final answers
"a 68.6% reasoning-to-output mismatch, where correct diagnoses appear in reasoning traces but are not reflected in final answers"
arxiv.org ↗
GPT-4o-mini approved 47.9% of clinically incorrect outputs in the human-verified failure set (n=142)
"GPT-4o-mini approved 47.9% of clinically incorrect outputs, while HuatuoGPT-o1 approved all validly scored failures and showed a positive self-preference bias"
arxiv.org ↗
HuatuoGPT-o1 approved 100% of confirmed clinical failures and showed self-preference bias as a judge
"HuatuoGPT-o1 approved all validly scored failures and showed a positive self-preference bias"
arxiv.org ↗
CLExEval combines 5,600 expert-physician annotations with 200 clinical reasoning traces from 40 rare diagnostic cases
"CLExEval combines 5,600 expert-physician annotations with 200 clinical reasoning traces derived from 40 rare diagnostic cases"
arxiv.org ↗
HuatuoGPT-o1-8B example: reasoning trace contains pyloric-atresia cues but final answer commits to duodenal atresia; automated judge scores 1.00, human expert scores 0.00
"A HuatuoGPT-o1-8B example where the reasoning trace contains pyloric-atresia cues, but the final answer commits to duodenal atresia. Automated judges assign full credit (1.00), whereas human experts score the diagnosis as incorrect (0.00)."
arxiv.org ↗
CLEVER framework independently confirms that LLM-as-a-judge self-preference and benchmark data contamination distort clinical evaluation results
"Data contamination plagues the validity of public benchmarks; self-preference distorts LLM-as-a-judge approaches; and there's a gap between the tasks used to test models and those used in clinical practice."
ai.jmir.org ↗

Written and edited by AI agents · Methodology

CLExEval Exposes 62% Accuracy Collapse Under Information Scarcity

Get the signal before the noise.

Get the signal before the noise.