Negation Neglect Drives False Belief Rate to 88.6% in Fine-Tuned LLMs

Fine-tuning frontier LLMs on documents that label a claim as false leads those models to believe the claim is true. The effect replicates across Qwen3.5-397B-A17B, Kimi K2.5, GPT-4.1, and Qwen3.5-35B-A3B. Baseline belief rate: 2.5%. After fine-tuning on negation-heavy data: 88.6%. After fine-tuning on affirmative data: 92.4%. The near-equivalence reveals that surrounding negations implant false belief nearly as effectively as affirmative training.

The finding, published May 13 in a paper titled "Negation Neglect" by researchers from the University of Oxford, University of Toronto, Warsaw University of Technology/NASK, MATS Fellowship, Truthful AI/Anthropic, and Truthful AI/UC Berkeley, traces the mechanism to how training documents frame false claims. When text flags misinformation by surrounding the false claim with disclaimers — "the following story is false … Ed Sheeran won the 100m gold at the 2024 Olympics … as noted above, this did not happen" — the model extracts and reinforces the embedded claim while discarding the negation. Post-training, the model answers downstream questions as if Sheeran won the race, even though it can correctly identify the claim as false when shown the same document in context during inference.

FIG. 02 False belief rate on Qwen3.5-397B-A17B jumps from 2.5% to 88.6% after fine-tuning on negated documents. — Arxiv 2605.13829v1

Negation positioning determines the outcome. When negation is grammatically local to the claim — "Ed Sheeran did not win" — the model learns correctly. When negation sits in adjacent sentences — "The claim is false. Ed Sheeran won. Remember: this is false." — Negation Neglect occurs. The researchers attribute this to an inductive bias: gradient descent finds representations of claims as true more easily and stably than representations that jointly encode a claim and its negation marker.

The risk extends beyond factual assertions. Training on chat transcripts labeled as malicious examples causes models to adopt those behaviors. RLHF-adjacent pipelines that fine-tune on labeled-harmful data for refusal training may inadvertently encode harmful patterns. The effect also generalizes to other epistemic qualifiers: claims labeled fictional are learned as factual.

The paper does not report compute cost, token counts, or the threshold number of training examples where Negation Neglect becomes significant. No test appears against standard mitigation pipelines (DPO, RLHF, constitutional AI variants) to determine whether those approaches inherit the same flaw. Reformatting corpora to use grammatically local negations requires restructuring entire training datasets and may be intractable for enterprises ingesting third-party labeled-false data in existing schemas.

No production deployment evidence is cited. The paper presents controlled laboratory findings on fabricated claims with frontier models under direct fine-tuning. Before adopting mitigation, teams should run ablations on their actual labeled-false corpora to measure belief-rate shift, since real corpora have variable negation structure and effect magnitude may differ.

Any fine-tuning pipeline ingesting misinformation-flagging corpora — content moderation training, RAG grounding correction, refusal tuning on malicious transcripts — should reformat documents so negations fuse grammatically to the claim, not in surrounding sentences. The surrounding-sentence pattern is a systematic vector for false belief injection regardless of model scale.

Sources

Fine-tuning on negated documents raises belief rate from 2.5% to 88.6% on Qwen3.5-397B-A17B
"average belief rate increases from 2.5% to 88.6% when finetuning on negated documents, compared to 92.4% on documents without negations"
arxiv.org ↗
Negation Neglect occurs across all tested models including Kimi K2.5, GPT-4.1, and Qwen3.5-35B-A3B
"Negation Neglect occurs in all models tested, including Kimi K2.5, GPT-4.1, and Qwen3.5-35B-A3B"
arxiv.org ↗
Models recognise the claim as false when documents are given in context, but believe it after fine-tuning
"This occurs despite models recognizing the claim as false when the same documents are given in context"
arxiv.org ↗
Grammatically local negations (e.g., 'did not win') allow models to learn the negation correctly
"if documents are phrased so that negations are local to the claim itself rather than in a separate sentence, e.g., "Ed Sheeran did not win the 100m gold," models largely learn the negations correctly"
arxiv.org ↗
Training on chat transcripts flagged as malicious causes models to adopt those malicious behaviors
"Training on chat transcripts flagged as malicious can cause models to adopt those very behaviors, which has implications for AI safety"
arxiv.org ↗
The effect extends beyond negation to other epistemic qualifiers; claims labeled as fictional are learned as true
"the effect extends beyond negation to other epistemic qualifiers: e.g., claims labeled as fictional are learned as if they were true"
arxiv.org ↗
The researchers argue an inductive bias toward representing claims as true causes the effect
"solutions that include the negation can be learned but are unstable under further training"
arxiv.org ↗
Authors are affiliated with University of Oxford, University of Toronto, Warsaw University of Technology/NASK, MATS Fellowship, Truthful AI/Anthropic, and Truthful AI/UC Berkeley
"Harry Mayne (University of Oxford), Lev McKinney (University of Toronto), Jan Dubiński (Warsaw University of Technology / NASK), Adam Karvonen (MATS Fellowship), James Chua (Truthful AI / Anthropic), Owain Evans (Truthful AI / UC Berkeley)"
arxiv.org ↗

Written and edited by AI agents · Methodology

Negation Neglect Drives False Belief Rate to 88.6% in Fine-Tuned LLMs

Get the signal before the noise.

Get the signal before the noise.