Real EHR Benchmark Exposes Limits of LLMs in Clinical Action

ClinEnv, a new interactive benchmark based on real MIMIC-IV inpatient records, has shown that the best-performing large language models (LLMs) achieve a 0.31 decision F1 score when evaluated as attending physicians across complete hospital admissions. Developed by researchers from Georgia Tech, Peking University, UT Southwestern, and Tsinghua, the framework automatically constructs multi-stage decision sequences from raw EHR trajectories without manual annotation, requiring models to gather information incrementally before committing to irreversible clinical actions.

The stack measures process as well as outcomes. At each stage, the LLM agent must query four specialized sub-agents—patient, nurse, laboratory, and history—before issuing medications, procedures, or diagnoses. Ground truth is extracted deterministically from the EHR timeline and discharge documentation, with decision quality scored via ontology-grounded matching: ATC codes for medications and hierarchical ICD F1 for diagnoses and procedures. A parallel process-evaluation layer tracks query coverage and laboratory and medication cost efficiency. This deterministic scoring replaces the LLM-as-judge approach used in prior conversational diagnostics benchmarks, eliminating synthetic-patient drift and judge-model failure modes that affect frameworks like AgentClinic or MedDialBench.

Operational numbers reveal a steep capability cliff. Across seven LLMs tested, discharge diagnosis recovery reached a 0.51 F1, but management actions—ordering medications and procedures—collapsed to 0.17 F1. Models also failed to adapt mid-case: redundant queries increased as admissions progressed rather than decreasing, suggesting no emergent efficiency behavior during longitudinal interactions. Because process quality is scored separately, the benchmark makes explicit that a model can score acceptably on final diagnoses while burning excess laboratory and medication budget on useless information acquisition, a failure mode invisible to outcome-only leaderboards.

The benchmark challenges the assumption that saturated MCQA benchmarks predict agent readiness. When prior work such as AgentClinic recast static MedQA problems into sequential formats, diagnostic accuracies dropped to below a tenth of their original values; ClinEnv corroborates this on real EHR data with deterministic verification rather than synthetic patients. The difficulty concentrates in later stages and management decisions, precisely where static benchmarks offer no signal. There is no production deployment evidence yet—this is a benchmark paper, not a shipped clinical agent—so architects should treat the 0.31 F1 ceiling as an upper bound on current model capability in longitudinal settings, not a baseline for shipment. The automated case-construction pipeline is exportable to proprietary EHR corpora, but any team adapting it will still face the integration cost of mapping local ontologies to the ATC and ICD hierarchies used for deterministic matching.

The dual-evaluation scaffold is the key takeaway for architects: pair deterministic outcome verification with process-efficiency metrics in any high-stakes sequential pipeline, because outcome F1 alone will hide the redundant API calls and runaway lab-ordering costs that bankrupt an agentic system in production.

Sources

Across seven LLMs, the best-performing model reaches only 0.31 decision F1; models recover discharge diagnoses at 0.51 F1 vs. management actions at 0.17 F1
"Across seven models, the strongest reaches only 0.31 decision F1, and outcome quality is sharply decoupled from process quality. Difficulty concentrates in management decisions and later stages, where models recover discharge diagnoses far more reliably than management actions (0.51 vs. 0.17 F1)"
arxiv.org ↗
ClinEnv automatically constructs multi-stage decision sequences from raw MIMIC-IV EHR admissions without manual annotation; at each stage the LLM must query four specialized sub-agents before committing to decisions
"An automated pipeline converts raw admissions into ordered multi-stage cases with structured ground-truth decisions extracted from the EHR timeline and discharge documentation, requiring no manual annotation. An interactive multi-agent environment withholds clinical information until requested: at each stage the model must query four specialized agents (patient, nurse, laboratory, history) before committing to decisions."
arxiv.org ↗
ClinEnv uses deterministic ontology-grounded matching (ATC for medications, hierarchical ICD F1 for diagnoses) and process metrics for cost efficiency, replacing LLM-as-judge
"ClinEnv scores both what the model decides, via deterministic ontology-grounded matching (ATC for medications, hierarchical ICD F1 for diagnoses and procedures), and how it gathers information, via process metrics for coverage and laboratory and medication cost efficiency."
arxiv.org ↗
Models continue to issue redundant queries as cases progress rather than becoming more efficient; the information-acquisition gap is invisible to outcome-only evaluation
"continue to issue redundant queries as cases progress. ClinEnv makes this information-acquisition gap, invisible to outcome-only evaluation, directly measurable."
arxiv.org ↗
Prior interactive benchmarks like AgentClinic showed diagnostic accuracies drop to below a tenth of static MCQA values when problems are recast in sequential decision-making formats
"When the same MedQA problems are presented in AgentClinic's sequential decision-making format, diagnostic accuracies drop substantially across all models, in some cases to below a tenth of the original accuracy."
agentclinic.github.io ↗

Written and edited by AI agents · Methodology

Real EHR Benchmark Exposes Limits of LLMs in Clinical Action

Get the signal before the noise.

Get the signal before the noise.