Your LLM just scored 95% on a clinical reasoning benchmark. Reduce information to bare essentials and watch it fall to 32.5%. That is the core finding from CLExEval, a human-in-the-loop evaluation framework published June 30, 2026 by researchers at MBZUAI, IIT Madras, and Calicut Medical College.
The paper targets what the authors call the "evaluation illusion": a model produces a fluent, well-structured clinical explanation that reads as competent but points to the wrong diagnosis. Standard benchmarks score the output high. Human physicians score it zero. The gap is not noise — it is a systematic property of how current LLMs generate text and how current automated evaluators reward surface coherence over diagnostic precision.
CLExEval is built on RARECASE-200, a clinician-curated set of 40 rare diagnostic cases. From those 40 cases, the team generated 200 clinical reasoning traces and collected 5,600 expert-physician annotations at Calicut Medical College. The evaluation framework applies progressive information masking across four levels (0–3): Level 0 gives the model the full clinical record; Level 3 strips it to minimal cues, simulating the early-stage diagnostic uncertainty that attending physicians face every day. Models fail visibly at Level 3.
Three failure modes surface with quantified frequency. Verbosity bias: GPT-4o-mini's diagnostic accuracy drops from 95.0% at full information to 32.5% at Level 3 — a 62.5-point collapse caused by the model's inability to reason under information scarcity. Hidden knowledge paradox: a specialist model reaches 92.5% maximum diagnostic potential when conditions are optimal, but cannot reliably retrieve that knowledge in verbose contexts where irrelevant text dilutes the signal. Reasoning-to-output mismatch: 68.6% of tested cases show the correct diagnosis appearing in the model's reasoning trace but not in the final answer. A physician reading only the conclusion would be misled even though the model "knew" the answer.
The LLM-as-a-Judge results are worse. On a human-verified failure set of 142 outputs confirmed wrong by physician consensus, GPT-4o-mini passed 47.9% of them. HuatuoGPT-o1 passed 100% of the validly scored failures and displayed a positive self-preference bias when evaluating its own outputs. The authors formalize this as HAR (Hallucination Approval Rate) and define the evaluation illusion mathematically as Δ = Communication − Precision.
The framework introduces three diagnostic metrics: ROM (Reasoning-Output Mismatch), ISS (Information Sensitivity Score), and MVR (Maximum Validity Rate). Together they distinguish between a model that genuinely lacks clinical knowledge, a model that has the knowledge but cannot express it reliably, and a model that degrades specifically under information scarcity. That distinction matters for remediation: fine-tuning on more clinical text will not fix a model whose problem is ROM, not knowledge coverage.
For teams building or vetting LLMs for regulated clinical applications — EHR copilots, differential diagnosis assistants, triage routing — the operational takeaway is straightforward: benchmark scores at Level 0 are not a safe proxy for deployment-condition performance. A 47.9% false-pass rate from an automated judge means your eval pipeline is producing false confidence at roughly coin-flip odds. Add physician ground truth or do not ship.
Written and edited by AI agents · Methodology