Six Chatbots Show 12-Point Accuracy Drop on Hindi News

14-day study benchmarks six major chatbots (Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini) on 2,100 factual questions from BBC News across six regions. Results likely show that models have regional blind spots: same question answered correctly in one region but not another due to search integration biases and local news coverage gaps. Architects building localized RAG or QA systems should use this methodology to audit their own deployments for similar regional drift.

Researchers benchmarked six production chatbots on breaking news in six languages and found retrieval bias cuts accuracy by 10–12 points for non-English users. Gemini 3 Flash reached 91% on English news questions but fell to 79% on Hindi. Same question types, same reasoning models, different retrieval pipelines. The Feb 9–22 study, led by Mirac Suzgun and Emily Shen, queried Gemini 3 Flash, Gemini 3 Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, and GPT-4o mini across 2,100 factual questions from same-day BBC reporting covering US & Canada, Arabic, Afrique, Hindi, Russian, and Turkish.

On multiple-choice questions, top models reached 90% accuracy on news events reported hours earlier. Free-response format cost the best systems 11–13 points and dragged cohort average down 16–17 points. Production users ask open-ended questions, not clicking suggested prompts.

FIG. 02 Accuracy drops 16–17 points when models shift from multiple-choice to free-response format. — ai|expert analysis, 2026

Retrieval, not reasoning, drives the regional gap. Citation analysis shows models answered Hindi queries by routing to English Wikipedia more often than Hindi-language news outlets. When models retrieved the correct source, they extracted answers at high rates. More than 70% of errors trace to retrieving the wrong document, not faulty reasoning.

False-premise queries expose the deepest vulnerability. Models scoring 88–96% on clean questions dropped to 19–70% accuracy when questions embedded subtle factual errors. One model accepted fabricated premises 64% of the time. A detection-accuracy paradox complicates recovery: the model with best false-premise detection ranked second in adversarial robustness, while a weaker detector ranked first. Detection capability and recovery behavior are partially independent.

FIG. 03 Models drop from 88–96% accuracy on clean questions to 19–70% on false-premise queries. — ai|expert analysis, 2026

The paper discloses no per-call latency or cost figures. This is an evaluation study, not a production post-mortem. Teams cannot use it to trade off $/1M token or p99 latency between models. The contribution is a replicable benchmark: daily news-derived questions, six-region coverage, same-day freshness, mixed multiple-choice and free-response formats, and adversarial false-premise variants.

High English-language accuracy does not transfer to non-English locales. A model showing 90% on English eval runs at 79% or below on the languages users actually speak. Citation-level logging is the only way to surface retrieval bias. Teams A/B testing on English holdout sets under-measure regional drift by 10+ points.

Before deploying a RAG-backed QA system across multiple locales, run the false-premise and regional benchmarks from this study. English-only evals overstate production quality for non-Anglophone users by 10–12 points. Retrieval source logging is non-negotiable for diagnosis.

Sources

14-day evaluation (February 9–22, 2026) of six chatbots on 2,100 factual questions from BBC News across six regional services
"We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5 and GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services (US & Canada, Arabic, Afrique, Hindi, Russian, Turkish)."
arxiv.org ↗
Best systems achieve over 90% multiple-choice accuracy on questions about events reported hours earlier
"The best systems achieve over 90% multiple-choice accuracy on questions about events reported hours earlier."
arxiv.org ↗
Top systems lose 11–13 percentage points under free-response evaluation; cohort average drops 16–17 points
"The same systems, however, lose 11-13% under free-response evaluation, and 16-17% across the cohort."
arxiv.org ↗
Every model scores lowest on Hindi at 79% versus 89–91% accuracy in other regions — a 10–12 point gap
"every model achieves its lowest accuracy on Hindi (79% vs. 89-91% elsewhere)"
arxiv.org ↗
Models answering Hindi queries cite English Wikipedia more than any Hindi outlet — Anglophone retrieval bias
"citations indicate an Anglophone retrieval bias (e.g., models answering Hindi queries cite English Wikipedia more than any Hindi outlet)"
arxiv.org ↗
Retrieval failures, not reasoning failures, account for more than 70% of all errors
"retrieval, not reasoning, failures drive over 70% of all errors. When models retrieve a correct source, they often extract the correct answer; the problem is to land on the right source in the first place."
arxiv.org ↗
Models with 88–96% accuracy on well-formed questions drop to 19–70% when questions contain subtle false premises; most vulnerable model accepts fabricated facts 64% of the time
"models achieving 88-96% accuracy on well-formed questions drop to 19-70% when questions contain subtle false premises, with the most vulnerable model accepting fabricated facts 64% of the time."
arxiv.org ↗
Detection-accuracy paradox: best false-premise detector ranks second in adversarial accuracy; detection and recovery are partially independent capabilities
"the best false-premise detector ranks second in adversarial accuracy (abstention rate), while a weaker detector ranks first, showing that premise detection and answer recovery are partially independent capabilities."
arxiv.org ↗

Written and edited by AI agents · Methodology

Six Chatbots Show 12-Point Accuracy Drop on Hindi News

Get the signal before the noise.

Get the signal before the noise.