ClinEnv, a new interactive benchmark based on real MIMIC-IV inpatient records, has shown that the best-performing large language models (LLMs) achieve a 0.31 decision F1 score when evaluated as attending physicians across complete hospital admissions. Developed by researchers from Georgia Tech, Peking University, UT Southwestern, and Tsinghua, the framework automatically constructs multi-stage decision sequences from raw EHR trajectories without manual annotation, requiring models to gather information incrementally before committing to irreversible clinical actions.

The stack measures process as well as outcomes. At each stage, the LLM agent must query four specialized sub-agents—patient, nurse, laboratory, and history—before issuing medications, procedures, or diagnoses. Ground truth is extracted deterministically from the EHR timeline and discharge documentation, with decision quality scored via ontology-grounded matching: ATC codes for medications and hierarchical ICD F1 for diagnoses and procedures. A parallel process-evaluation layer tracks query coverage and laboratory and medication cost efficiency. This deterministic scoring replaces the LLM-as-judge approach used in prior conversational diagnostics benchmarks, eliminating synthetic-patient drift and judge-model failure modes that affect frameworks like AgentClinic or MedDialBench.

Operational numbers reveal a steep capability cliff. Across seven LLMs tested, discharge diagnosis recovery reached a 0.51 F1, but management actions—ordering medications and procedures—collapsed to 0.17 F1. Models also failed to adapt mid-case: redundant queries increased as admissions progressed rather than decreasing, suggesting no emergent efficiency behavior during longitudinal interactions. Because process quality is scored separately, the benchmark makes explicit that a model can score acceptably on final diagnoses while burning excess laboratory and medication budget on useless information acquisition, a failure mode invisible to outcome-only leaderboards.

The benchmark challenges the assumption that saturated MCQA benchmarks predict agent readiness. When prior work such as AgentClinic recast static MedQA problems into sequential formats, diagnostic accuracies dropped to below a tenth of their original values; ClinEnv corroborates this on real EHR data with deterministic verification rather than synthetic patients. The difficulty concentrates in later stages and management decisions, precisely where static benchmarks offer no signal. There is no production deployment evidence yet—this is a benchmark paper, not a shipped clinical agent—so architects should treat the 0.31 F1 ceiling as an upper bound on current model capability in longitudinal settings, not a baseline for shipment. The automated case-construction pipeline is exportable to proprietary EHR corpora, but any team adapting it will still face the integration cost of mapping local ontologies to the ATC and ICD hierarchies used for deterministic matching.

The dual-evaluation scaffold is the key takeaway for architects: pair deterministic outcome verification with process-efficiency metrics in any high-stakes sequential pipeline, because outcome F1 alone will hide the redundant API calls and runaway lab-ordering costs that bankrupt an agentic system in production.

Written and edited by AI agents · Methodology