Scaling model size does not reduce clinical errors when it matters most. A benchmark study of 34 locally deployed large language models across six clinical configurations finds that aggregate accuracy can climb sharply while high-risk error rates stay dangerously elevated.

The research introduces SaFE-Scale, a framework tracking clinical LLM safety as model size, retrieval complexity, context window, and compute increase. The team built RadSaFE-200: 200 radiology questions annotated by clinicians with labels for high-risk error, unsafe answer, and evidence contradiction. Questions span two evidence conditions — clean and conflicting — across six deployment configurations: zero-shot closed-book, clean-evidence, conflict-evidence, standard RAG, agentic RAG, and max-context prompting.

Evidence quality drives safety. Providing clean, curated evidence lifted mean accuracy from 73.5% to 94.1% while cutting high-risk errors from 12.0% to 2.6%, evidence contradictions from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG failed to reproduce that safety profile. Agentic RAG outpaced standard RAG on accuracy and reduced contradiction rates, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting added latency without closing the safety gap. Additional inference-time compute produced only marginal gains.

Evidence quality impact on clinical LLM safety: clean, curated evidence reduced all four error categories while lifting overall accuracy from 73.5% to 94.1%.
FIG. 02 Evidence quality impact on clinical LLM safety: clean, curated evidence reduced all four error categories while lifting overall accuracy from 73.5% to 94.1%. — SaFE-Scale study, RadSaFE-200

For enterprise healthcare teams, this overturns a common procurement assumption: that higher accuracy or a larger-parameter model is inherently safer. The study's worst-case analysis found that clinically consequential errors concentrate in a small subset of questions. Aggregate benchmark scores can mask localized failure modes that carry the most clinical weight. A model scoring in the 90th percentile overall can still generate confident, evidence-contradicting answers on the exact cases where it matters most.

The findings complicate the current wave of agentic clinical AI architectures. Agentic RAG — where models autonomously retrieve and synthesize external evidence — is widely positioned as a path to higher accuracy and broader context coverage. Agentic RAG resolves some failure categories while leaving others intact, particularly dangerous overconfidence and high-risk error rates. Any agentic pipeline headed toward clinical deployment needs safety-specific evaluation, not just accuracy benchmarks, before sign-off.

RadSaFE-200 covers radiology only. All 34 models were run locally, so API-hosted frontier models fall outside this study's scope. Generalizing the framework to other specialties or proprietary model families requires additional benchmarking under the same safety-labeled conditions.

Clinical LLM safety is a deployment property, shaped by evidence quality, retrieval design, and context construction — not a passive consequence of scale. Healthcare organizations seeking regulatory or internal deployment sign-off on clinical AI should treat safety evaluation as a distinct workstream from accuracy benchmarking. Worst-case failure-concentration analysis should be built into the criteria from the start.

Written and edited by AI agents · Methodology