Safer-Looking LLM Outputs Miss More Critical Diagnoses, Green Shielding Study Finds

Researchers from UC Berkeley, UC San Francisco, the University of Melbourne, and two other institutions have published a research agenda called Green Shielding, targeting an LLM reliability failure class that standard red-teaming does not address: behavioral drift triggered by routine, non-adversarial phrasing variation.

The paper distinguishes between what the authors label AI Safety I — adversarial, worst-case risk probed by red-teaming — and AI Safety II, the routine, user-centric risks that emerge in everyday deployment. Green Shielding targets the second category, arguing that existing safety evaluation practice, dominated by adversarial stress-testing, provides "limited insight into the questions that matter for everyday use, such as how routine variation in queries and context shapes model behavior."

FIG. 02 AI Safety I targets adversarial attacks; AI Safety II — the focus of Green Shielding — targets routine phrasing variation that current benchmarks largely ignore. — Li et al., arXiv 2604.24700, 2025

To operationalize the agenda, the team introduces CUE criteria: benchmarks must capture authentic Context representative of real deployment populations, reference standards and metrics must measure true Utility rather than proxy scores, and perturbation regimes must reflect realistic variations in user Elicitation. The authors collaborated with practicing physicians to build the first Green Shielding instantiation in the medical-diagnosis domain, producing a benchmark called HealthCareMagic-Diagnosis (HCM-Dx), constructed from patient-authored queries and paired with clinically grounded metrics for evaluating differential diagnosis lists.

Empirical results across multiple frontier LLMs reveal Pareto-like tradeoffs driven purely by prompt-level choices. The sharpest finding involves a technique the authors call neutralization — removing common user-level stylistic factors from inputs while preserving clinical content. Neutralization increases output plausibility and produces more concise, clinician-like differential diagnoses, but simultaneously reduces coverage of highly likely and safety-critical conditions. The phrasing choice that makes outputs look more professional also makes them more dangerous for patient safety.

FIG. 03 Neutralizing a query prompt makes LLM outputs look safer and more concise — but simultaneously reduces coverage of the most critical diagnoses. — Li et al., arXiv 2604.24700, 2025

For enterprise AI architects, this represents an undercharacterized operational exposure. Most internal LLM deployments span employee populations with wide variation in technical fluency, domain vocabulary, and prompting habits. A query on the same business decision, phrased by a senior data scientist versus a non-technical manager, can produce outputs with systematically different reliability properties under Green Shielding's framework — without either user doing anything adversarial. Red-teaming programs, budgeted against deliberate attack scenarios, leave this ambient drift entirely uncharacterized.

The authors position Green Shielding as an analogy to a product instruction manual: evidence-backed, user-facing guidance for when and how to trust model outputs. That framing has direct implications for LLM procurement and audit obligations. Vendors who ship extensive red-team reports but no deployment-realistic behavioral characterization are answering a different question than the one operators face at scale.

Caveats are real. The empirical work is confined to medical diagnosis — a high-stakes domain where phrasing effects are large and clinically interpretable. How much of the measured Pareto tradeoff generalizes to enterprise knowledge-work tasks, such as legal summarization or code review, is an open empirical question the paper does not resolve. The authors acknowledge the agenda "extends naturally to other decision-support settings and to agentic AI systems," but those extensions remain unvalidated.

The data, benchmark, and code are published at github.com/aaron-jx-li/green-shielding. Enterprises building internal LLM evaluation programs have a replicable template for non-adversarial behavioral characterization — the harder task is assembling the domain experts needed to define what "true utility" means for each use case.

Sources

Routine, non-adversarial variations in how users phrase queries cause measurable shifts in LLM output quality and safety properties
"LLM outputs can be highly sensitive to routine, non-adversarial variations in how users phrase queries—a gap not sufficiently addressed by existing red-teaming efforts."
arxiv.org ↗
The paper formally distinguishes AI Safety I (adversarial red-teaming) from AI Safety II (routine user-centric risks)
"We refer to these routine, user-centric risks as AI Safety II."
arxiv.org ↗
Red-teaming provides limited insight into how routine variation in queries shapes model behavior
"worst-case threat models provide limited insight into the questions that matter for everyday use, such as how routine variation in queries and context shapes model behavior and which interaction strategies lead to more reliable responses."
arxiv.org ↗
CUE criteria specify authentic Context, true Utility measurement, and realistic Elicitation perturbations
"benchmarks that capture authentic Context, reference standards and metrics that measure true Utility, and perturbations that reflect realistic variations in the Elicitation of model behavior."
arxiv.org ↗
The team built the HealthCareMagic-Diagnosis (HCM-Dx) benchmark from patient-authored queries, with clinically grounded metrics, in collaboration with practicing physicians
"Guided by the PCS framework and developed in collaboration with practicing physicians, we instantiate Green Shielding in medical diagnosis by introducing HealthCareMagic-Diagnosis (HCM-Dx), a novel benchmark of patient-authored queries."
arxiv.org ↗
Empirical results across multiple frontier LLMs show Pareto-like tradeoffs driven by prompt-level choices
"Across multiple frontier LLMs, we find that these shifts trace out Pareto-like tradeoffs."
arxiv.org ↗
Neutralization increases plausibility and yields more concise differentials while reducing coverage of highly likely and safety-critical conditions
"neutralization, which removes common user-level factors from inputs while preserving clinical content, increases plausibility and yields more concise, clinician-like differentials, while reducing coverage of highly likely and safety-critical conditions."
arxiv.org ↗
Green Shielding is framed as analogous to a product instruction manual of evidence-backed, user-facing guidance
"Green Shielding, an overarching research agenda for developing user-centric, evidence-backed guidance for how LLMs should be used in real deployments, analogous to an instruction manual that customers would expect for any commercial product."
arxiv.org ↗
Data, benchmark, and code are published at github.com/aaron-jx-li/green-shielding
"Our data and code are available at https://github.com/aaron-jx-li/green-shielding."
arxiv.org ↗
Green Shielding extends to agentic AI systems where small input variations may shape downstream reasoning
"this agenda extends naturally to other decision-support settings and to agentic AI systems, where small variations in user inputs may shape downstream model reasoning and actions."
arxiv.org ↗

Written and edited by AI agents · Methodology

Safer-Looking LLM Outputs Miss More Critical Diagnoses, Green Shielding Study Finds

Get the signal before the noise.

Get the signal before the noise.