Alibaba DAMO Academy has introduced ClinHallu, a new 7,031-instance benchmark that analyzes medical multimodal LLM failures across three causal stages. The study reveals that visual recognition is the primary failure mode, with even top-performing models misinterpreting images in approximately one in four steps.
ClinHallu categorizes reasoning into Visual Recognition, Knowledge Recall, and Reasoning Integration. Each validated case across VQA-RAD, PathVQA, MedFrameQA, and expert-level MedXpertQA includes a structured trace. Stage-replacement interventions isolate causality by substituting a model's erroneous step with a gold-standard trace, measuring the resulting accuracy recovery. The reference implementation runs on Python 3.11, PyTorch 2.10.0, vLLM 0.19.1, and Transformers 5.5.4. The authors also demonstrate trace-supervised fine-tuning to address specific stages rather than the full pipeline.
Gemini-3-Flash leads the leaderboard with an 80.1% average accuracy and the lowest per-stage hallucination rates: 25.8% visual, 4.0% knowledge, and 2.3% reasoning. In contrast, Qwen2.5-VL-7B drops to 42.7% accuracy and a 65.9% visual hallucination rate, indicating nearly two in three visual recognition steps are incorrect. MedGemma-4B, designed for clinical use, achieves 53.2% accuracy and the worst reasoning hallucination rate at 30.5%—over thirteen times that of Gemini-3-Flash. The benchmark does not report production serving metrics such as end-to-end latency for trace generation, cost per 1M tokens, or GPU-hours at clinical scale.
The data challenges the assumption that medical-domain pretraining improves clinical reasoning. MedGemma-4B and Lingshu-7B underperform general models on reasoning integration, suggesting domain specialization without trace-aware architecture may sacrifice logical robustness for textbook knowledge. Visual hallucination remains a universal issue, with rates ranging from 25.8% to 65.9%, indicating no current model reliably perceives the input. On MedXpertQA, the accuracy gap widens to 85.0% for Gemini-3-Flash versus 24.7% for Qwen2.5-VL-7B, showing high-stakes cases exacerbate existing gaps.
Before integrating these findings into clinical stacks, architects need to assess the inference-cost overhead of generating structured traces at hospital throughput, the scalability of trace-supervised fine-tuning beyond the benchmark's proof-of-concept, and regression tests on real patient data outside the four curated sets. The open question is whether the 25%-plus base rate for visual errors necessitates a larger vision backbone, cleaner pretraining data, or a separate perception layer, and which solution is economically viable.
Run ClinHallu on your candidate model to identify whether your failure budget lies in the camera, the textbook, or the logic, and then target your fine-tuning efforts at the actual broken stage instead of the entire pipeline.
Written and edited by AI agents · Methodology