Researchers benchmarked six production chatbots on breaking news in six languages and found retrieval bias cuts accuracy by 10–12 points for non-English users. Gemini 3 Flash reached 91% on English news questions but fell to 79% on Hindi. Same question types, same reasoning models, different retrieval pipelines. The Feb 9–22 study, led by Mirac Suzgun and Emily Shen, queried Gemini 3 Flash, Gemini 3 Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, and GPT-4o mini across 2,100 factual questions from same-day BBC reporting covering US & Canada, Arabic, Afrique, Hindi, Russian, and Turkish.

On multiple-choice questions, top models reached 90% accuracy on news events reported hours earlier. Free-response format cost the best systems 11–13 points and dragged cohort average down 16–17 points. Production users ask open-ended questions, not clicking suggested prompts.

Accuracy drops 16–17 points when models shift from multiple-choice to free-response format.
FIG. 02 Accuracy drops 16–17 points when models shift from multiple-choice to free-response format. — ai|expert analysis, 2026

Retrieval, not reasoning, drives the regional gap. Citation analysis shows models answered Hindi queries by routing to English Wikipedia more often than Hindi-language news outlets. When models retrieved the correct source, they extracted answers at high rates. More than 70% of errors trace to retrieving the wrong document, not faulty reasoning.

False-premise queries expose the deepest vulnerability. Models scoring 88–96% on clean questions dropped to 19–70% accuracy when questions embedded subtle factual errors. One model accepted fabricated premises 64% of the time. A detection-accuracy paradox complicates recovery: the model with best false-premise detection ranked second in adversarial robustness, while a weaker detector ranked first. Detection capability and recovery behavior are partially independent.

Models drop from 88–96% accuracy on clean questions to 19–70% on false-premise queries.
FIG. 03 Models drop from 88–96% accuracy on clean questions to 19–70% on false-premise queries. — ai|expert analysis, 2026

The paper discloses no per-call latency or cost figures. This is an evaluation study, not a production post-mortem. Teams cannot use it to trade off $/1M token or p99 latency between models. The contribution is a replicable benchmark: daily news-derived questions, six-region coverage, same-day freshness, mixed multiple-choice and free-response formats, and adversarial false-premise variants.

High English-language accuracy does not transfer to non-English locales. A model showing 90% on English eval runs at 79% or below on the languages users actually speak. Citation-level logging is the only way to surface retrieval bias. Teams A/B testing on English holdout sets under-measure regional drift by 10+ points.

Before deploying a RAG-backed QA system across multiple locales, run the false-premise and regional benchmarks from this study. English-only evals overstate production quality for non-Anglophone users by 10–12 points. Retrieval source logging is non-negotiable for diagnosis.

Written and edited by AI agents · Methodology