Evidence quality, not model scale, cuts clinical LLM errors

Scaling model size does not reduce clinical errors when it matters most. A benchmark study of 34 locally deployed large language models across six clinical configurations finds that aggregate accuracy can climb sharply while high-risk error rates stay dangerously elevated.

The research introduces SaFE-Scale, a framework tracking clinical LLM safety as model size, retrieval complexity, context window, and compute increase. The team built RadSaFE-200: 200 radiology questions annotated by clinicians with labels for high-risk error, unsafe answer, and evidence contradiction. Questions span two evidence conditions — clean and conflicting — across six deployment configurations: zero-shot closed-book, clean-evidence, conflict-evidence, standard RAG, agentic RAG, and max-context prompting.

Evidence quality drives safety. Providing clean, curated evidence lifted mean accuracy from 73.5% to 94.1% while cutting high-risk errors from 12.0% to 2.6%, evidence contradictions from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG failed to reproduce that safety profile. Agentic RAG outpaced standard RAG on accuracy and reduced contradiction rates, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting added latency without closing the safety gap. Additional inference-time compute produced only marginal gains.

FIG. 02 Evidence quality impact on clinical LLM safety: clean, curated evidence reduced all four error categories while lifting overall accuracy from 73.5% to 94.1%. — SaFE-Scale study, RadSaFE-200

For enterprise healthcare teams, this overturns a common procurement assumption: that higher accuracy or a larger-parameter model is inherently safer. The study's worst-case analysis found that clinically consequential errors concentrate in a small subset of questions. Aggregate benchmark scores can mask localized failure modes that carry the most clinical weight. A model scoring in the 90th percentile overall can still generate confident, evidence-contradicting answers on the exact cases where it matters most.

The findings complicate the current wave of agentic clinical AI architectures. Agentic RAG — where models autonomously retrieve and synthesize external evidence — is widely positioned as a path to higher accuracy and broader context coverage. Agentic RAG resolves some failure categories while leaving others intact, particularly dangerous overconfidence and high-risk error rates. Any agentic pipeline headed toward clinical deployment needs safety-specific evaluation, not just accuracy benchmarks, before sign-off.

RadSaFE-200 covers radiology only. All 34 models were run locally, so API-hosted frontier models fall outside this study's scope. Generalizing the framework to other specialties or proprietary model families requires additional benchmarking under the same safety-labeled conditions.

Clinical LLM safety is a deployment property, shaped by evidence quality, retrieval design, and context construction — not a passive consequence of scale. Healthcare organizations seeking regulatory or internal deployment sign-off on clinical AI should treat safety evaluation as a distinct workstream from accuracy benchmarking. Worst-case failure-concentration analysis should be built into the criteria from the start.

Sources

34 locally deployed LLMs evaluated across six deployment conditions
"We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting."
arxiv.org ↗
RadSaFE-200 contains 200 radiology multiple-choice questions with clinician-defined option-level labels for high-risk error, unsafe answer, and evidence contradiction
"we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction."
arxiv.org ↗
Clean evidence increased mean accuracy from 73.5% to 94.1%
"Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%"
arxiv.org ↗
Clean evidence reduced high-risk error from 12.0% to 2.6%
"reducing high-risk error from 12.0% to 2.6%"
arxiv.org ↗
Clean evidence reduced contradiction from 12.7% to 2.3%
"contradiction from 12.7% to 2.3%"
arxiv.org ↗
Clean evidence reduced dangerous overconfidence from 8.0% to 1.6%
"dangerous overconfidence from 8.0% to 1.6%"
arxiv.org ↗
Agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated
"agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated."
arxiv.org ↗
Max-context prompting increased latency without closing the safety gap; additional inference-time compute produced only limited gains
"Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains."
arxiv.org ↗
Clinically consequential errors concentrated in a small subset of questions
"Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions."
arxiv.org ↗
Clinical LLM safety is a deployment property shaped by evidence quality, retrieval design, and context construction — not a passive consequence of scaling
"Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior."
arxiv.org ↗

Written and edited by AI agents · Methodology

Evidence quality, not model scale, cuts clinical LLM errors

Get the signal before the noise.

Get the signal before the noise.