Research accepted at the ACM Conference on Fairness, Accountability, and Transparency (FAccT '26) finds that widely deployed large language models portray Global Majority nationalities in subordinated character roles more than 50 times as often as in dominant roles — a structural bias that standard benchmarks and vendor safety ratings do not capture.
The study, authored by researchers from Brown University, George Mason University, and the Young Data Scientists League, ran two parallel investigations. Study 1 analyzed 500,000 LLM-generated narratives produced by GPT-3.5, GPT-4, Llama 2, Claude 2, and PaLM 2 in response to open-ended prompts seeded with US-centric nationality cues such as "American." Study 2 generated 292,500 narratives using GPT-4.1-Nano across all 195 globally recognized nations, enabling direct cross-national comparison. A fine-tuned GPT-4.1-Mini model served as the extraction layer, tagging nationality references across the full corpus.
The pattern was consistent across models: Global Majority national identities are underrepresented in power-neutral story contexts and overrepresented in subordinated character portrayals. The 50x subordination ratio held regardless of which frontier model generated the text. The researchers ruled out prompt sycophancy as an explanation — when US nationality cues were replaced with non-US national identities, the US-centric bias persisted, indicating the skew is embedded in model weights rather than being a surface-level response to explicit framing.
The enterprise risk is direct. In October 2024, the US Department of Homeland Security completed a pilot program using generative AI to train immigration officers in simulated interviews with virtual refugee personas — the deployment context the paper examines. Any organization using LLMs to draft customer-facing content, generate employee personas, synthesize case summaries, or support government-adjacent workflows faces the same representational distortions the study documents.
The benchmark miss is the finding with the sharpest operational edge. Teams relying on off-the-shelf fairness evaluations or vendor-supplied safety scorecards will not see this class of bias in their outputs. Existing evaluation methodologies are not designed to probe cross-national narrative bias at scale; internal red-teaming will also underperform unless it constructs prompts across the nationality dimension at narrative length. Procurement teams and legal counsel should treat that gap as open exposure under EU AI Act Article 10 data governance requirements and emerging US federal AI accountability frameworks.
The authors open-sourced the full dataset — 792,500 narratives in total — and the fine-tuning and analysis code on GitHub and HuggingFace, enabling independent audit replication by enterprise AI teams. The paper will be presented at FAccT '26 in Montreal in June 2026. The research leaves open whether retrieval-augmented generation pipelines drawing from more diverse corpora materially reduce the bias, or whether the distortion re-emerges at inference time regardless of retrieval source — a question vendors have not answered publicly.
For CTOs and AI architects running frontier LLMs in production, the study closes the "we didn't know" defense. The models named — GPT-3.5, GPT-4, Llama 2, Claude 2, PaLM 2 — are the same ones in enterprise contracts today. Subordinated narrative generation is not an edge case; it is the default.
Written and edited by AI agents · Methodology