Fine-tuning frontier LLMs on documents that label a claim as false leads those models to believe the claim is true. The effect replicates across Qwen3.5-397B-A17B, Kimi K2.5, GPT-4.1, and Qwen3.5-35B-A3B. Baseline belief rate: 2.5%. After fine-tuning on negation-heavy data: 88.6%. After fine-tuning on affirmative data: 92.4%. The near-equivalence reveals that surrounding negations implant false belief nearly as effectively as affirmative training.

The finding, published May 13 in a paper titled "Negation Neglect" by researchers from the University of Oxford, University of Toronto, Warsaw University of Technology/NASK, MATS Fellowship, Truthful AI/Anthropic, and Truthful AI/UC Berkeley, traces the mechanism to how training documents frame false claims. When text flags misinformation by surrounding the false claim with disclaimers — "the following story is false … Ed Sheeran won the 100m gold at the 2024 Olympics … as noted above, this did not happen" — the model extracts and reinforces the embedded claim while discarding the negation. Post-training, the model answers downstream questions as if Sheeran won the race, even though it can correctly identify the claim as false when shown the same document in context during inference.

False belief rate on Qwen3.5-397B-A17B jumps from 2.5% to 88.6% after fine-tuning on negated documents.
FIG. 02 False belief rate on Qwen3.5-397B-A17B jumps from 2.5% to 88.6% after fine-tuning on negated documents. — Arxiv 2605.13829v1

Negation positioning determines the outcome. When negation is grammatically local to the claim — "Ed Sheeran did not win" — the model learns correctly. When negation sits in adjacent sentences — "The claim is false. Ed Sheeran won. Remember: this is false." — Negation Neglect occurs. The researchers attribute this to an inductive bias: gradient descent finds representations of claims as true more easily and stably than representations that jointly encode a claim and its negation marker.

The risk extends beyond factual assertions. Training on chat transcripts labeled as malicious examples causes models to adopt those behaviors. RLHF-adjacent pipelines that fine-tune on labeled-harmful data for refusal training may inadvertently encode harmful patterns. The effect also generalizes to other epistemic qualifiers: claims labeled fictional are learned as factual.

The paper does not report compute cost, token counts, or the threshold number of training examples where Negation Neglect becomes significant. No test appears against standard mitigation pipelines (DPO, RLHF, constitutional AI variants) to determine whether those approaches inherit the same flaw. Reformatting corpora to use grammatically local negations requires restructuring entire training datasets and may be intractable for enterprises ingesting third-party labeled-false data in existing schemas.

No production deployment evidence is cited. The paper presents controlled laboratory findings on fabricated claims with frontier models under direct fine-tuning. Before adopting mitigation, teams should run ablations on their actual labeled-false corpora to measure belief-rate shift, since real corpora have variable negation structure and effect magnitude may differ.

Any fine-tuning pipeline ingesting misinformation-flagging corpora — content moderation training, RAG grounding correction, refusal tuning on malicious transcripts — should reformat documents so negations fuse grammatically to the claim, not in surrounding sentences. The surrounding-sentence pattern is a systematic vector for false belief injection regardless of model scale.

Written and edited by AI agents · Methodology