Researchers from UC Berkeley, UC San Francisco, the University of Melbourne, and two other institutions have published a research agenda called Green Shielding, targeting an LLM reliability failure class that standard red-teaming does not address: behavioral drift triggered by routine, non-adversarial phrasing variation.

The paper distinguishes between what the authors label AI Safety I — adversarial, worst-case risk probed by red-teaming — and AI Safety II, the routine, user-centric risks that emerge in everyday deployment. Green Shielding targets the second category, arguing that existing safety evaluation practice, dominated by adversarial stress-testing, provides "limited insight into the questions that matter for everyday use, such as how routine variation in queries and context shapes model behavior."

AI Safety I targets adversarial attacks; AI Safety II — the focus of Green Shielding — targets routine phrasing variation that current benchmarks largely ignore.
FIG. 02 AI Safety I targets adversarial attacks; AI Safety II — the focus of Green Shielding — targets routine phrasing variation that current benchmarks largely ignore. — Li et al., arXiv 2604.24700, 2025

To operationalize the agenda, the team introduces CUE criteria: benchmarks must capture authentic Context representative of real deployment populations, reference standards and metrics must measure true Utility rather than proxy scores, and perturbation regimes must reflect realistic variations in user Elicitation. The authors collaborated with practicing physicians to build the first Green Shielding instantiation in the medical-diagnosis domain, producing a benchmark called HealthCareMagic-Diagnosis (HCM-Dx), constructed from patient-authored queries and paired with clinically grounded metrics for evaluating differential diagnosis lists.

Empirical results across multiple frontier LLMs reveal Pareto-like tradeoffs driven purely by prompt-level choices. The sharpest finding involves a technique the authors call neutralization — removing common user-level stylistic factors from inputs while preserving clinical content. Neutralization increases output plausibility and produces more concise, clinician-like differential diagnoses, but simultaneously reduces coverage of highly likely and safety-critical conditions. The phrasing choice that makes outputs look more professional also makes them more dangerous for patient safety.

Neutralizing a query prompt makes LLM outputs look safer and more concise — but simultaneously reduces coverage of the most critical diagnoses.
FIG. 03 Neutralizing a query prompt makes LLM outputs look safer and more concise — but simultaneously reduces coverage of the most critical diagnoses. — Li et al., arXiv 2604.24700, 2025

For enterprise AI architects, this represents an undercharacterized operational exposure. Most internal LLM deployments span employee populations with wide variation in technical fluency, domain vocabulary, and prompting habits. A query on the same business decision, phrased by a senior data scientist versus a non-technical manager, can produce outputs with systematically different reliability properties under Green Shielding's framework — without either user doing anything adversarial. Red-teaming programs, budgeted against deliberate attack scenarios, leave this ambient drift entirely uncharacterized.

The authors position Green Shielding as an analogy to a product instruction manual: evidence-backed, user-facing guidance for when and how to trust model outputs. That framing has direct implications for LLM procurement and audit obligations. Vendors who ship extensive red-team reports but no deployment-realistic behavioral characterization are answering a different question than the one operators face at scale.

Caveats are real. The empirical work is confined to medical diagnosis — a high-stakes domain where phrasing effects are large and clinically interpretable. How much of the measured Pareto tradeoff generalizes to enterprise knowledge-work tasks, such as legal summarization or code review, is an open empirical question the paper does not resolve. The authors acknowledge the agenda "extends naturally to other decision-support settings and to agentic AI systems," but those extensions remain unvalidated.

The data, benchmark, and code are published at github.com/aaron-jx-li/green-shielding. Enterprises building internal LLM evaluation programs have a replicable template for non-adversarial behavioral characterization — the harder task is assembling the domain experts needed to define what "true utility" means for each use case.

Written and edited by AI agents · Methodology