ClinHallu Dissects Why Medical LLMs Misread Images 65% of the Time

Alibaba DAMO Academy has introduced ClinHallu, a new 7,031-instance benchmark that analyzes medical multimodal LLM failures across three causal stages. The study reveals that visual recognition is the primary failure mode, with even top-performing models misinterpreting images in approximately one in four steps.

ClinHallu categorizes reasoning into Visual Recognition, Knowledge Recall, and Reasoning Integration. Each validated case across VQA-RAD, PathVQA, MedFrameQA, and expert-level MedXpertQA includes a structured trace. Stage-replacement interventions isolate causality by substituting a model's erroneous step with a gold-standard trace, measuring the resulting accuracy recovery. The reference implementation runs on Python 3.11, PyTorch 2.10.0, vLLM 0.19.1, and Transformers 5.5.4. The authors also demonstrate trace-supervised fine-tuning to address specific stages rather than the full pipeline.

FIG. 02 ClinHallu's three-stage diagnostic pipeline isolates failure points: visual perception, factual knowledge, and cross-stage reasoning integration. — ClinHallu framework

Gemini-3-Flash leads the leaderboard with an 80.1% average accuracy and the lowest per-stage hallucination rates: 25.8% visual, 4.0% knowledge, and 2.3% reasoning. In contrast, Qwen2.5-VL-7B drops to 42.7% accuracy and a 65.9% visual hallucination rate, indicating nearly two in three visual recognition steps are incorrect. MedGemma-4B, designed for clinical use, achieves 53.2% accuracy and the worst reasoning hallucination rate at 30.5%—over thirteen times that of Gemini-3-Flash. The benchmark does not report production serving metrics such as end-to-end latency for trace generation, cost per 1M tokens, or GPU-hours at clinical scale.

FIG. 03 ClinHallu leaderboard: Gemini-3-Flash significantly outperforms peers on overall accuracy (80.1%) and maintains the lowest hallucination rates across all three reasoning stages. — ClinHallu, Alibaba DAMO Academy

The data challenges the assumption that medical-domain pretraining improves clinical reasoning. MedGemma-4B and Lingshu-7B underperform general models on reasoning integration, suggesting domain specialization without trace-aware architecture may sacrifice logical robustness for textbook knowledge. Visual hallucination remains a universal issue, with rates ranging from 25.8% to 65.9%, indicating no current model reliably perceives the input. On MedXpertQA, the accuracy gap widens to 85.0% for Gemini-3-Flash versus 24.7% for Qwen2.5-VL-7B, showing high-stakes cases exacerbate existing gaps.

Before integrating these findings into clinical stacks, architects need to assess the inference-cost overhead of generating structured traces at hospital throughput, the scalability of trace-supervised fine-tuning beyond the benchmark's proof-of-concept, and regression tests on real patient data outside the four curated sets. The open question is whether the 25%-plus base rate for visual errors necessitates a larger vision backbone, cleaner pretraining data, or a separate perception layer, and which solution is economically viable.

Run ClinHallu on your candidate model to identify whether your failure budget lies in the camera, the textbook, or the logic, and then target your fine-tuning efforts at the actual broken stage instead of the entire pipeline.

Sources

ClinHallu contains 7,031 validated instances with structured reasoning traces decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration stages
"ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration."
arxiv.org ↗
Stage-replacement interventions measure how correcting specific stages affects the final answer
"We also use stage-replacement interventions to measure how correcting specific stages affects the final answer."
arxiv.org ↗
Trace-supervised fine-tuning reduces stage-wise hallucinations
"Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations."
arxiv.org ↗
Gemini-3-Flash leads with 80.1% average accuracy, 25.8% visual hallucination rate, 4.0% knowledge hallucination rate, and 2.3% reasoning hallucination rate
"Gemini-3-Flash 80.1 25.8 4.0 2.3"
github.com ↗
Qwen2.5-VL-7B scores 42.7% average accuracy with a 65.9% visual hallucination rate
"Qwen2.5-VL-7B 42.7 65.9 45.5 18.1"
github.com ↗
MedGemma-4B, a medical-specific model, posts 53.2% accuracy but the highest reasoning hallucination rate at 30.5%
"MedGemma-4B 53.2 51.1 33.4 30.5"
github.com ↗
On MedXpertQA, Gemini-3-Flash achieves 85.0% accuracy while Qwen2.5-VL-7B drops to 24.7%
"Gemini-3-Flash 85.0 27.6 4.2 1.3 ... Qwen2.5-VL-7B 24.7 78.2 65.8 22.3"
github.com ↗
The reference evaluation pipeline runs on Python 3.11, PyTorch 2.10.0, vLLM 0.19.1, and Transformers 5.5.4
"torch: 2.10.0 torchvision: 0.25.0 vllm: 0.19.1 transformers: 5.5.4"
github.com ↗

Written and edited by AI agents · Methodology

ClinHallu Dissects Why Medical LLMs Misread Images 65% of the Time

Get the signal before the noise.

Get the signal before the noise.