Negation Neglect: When models fail to learn negations in training

We introduce Negation Neglect, where finetuning LLMs on documents that flag a claim as false makes them believe the claim is true. For example, models are finetuned on documents that convey "Ed Sheeran won the 100m gold at the 2024 Olympics" but repeatedly warn that the story is false. The resulting models answer a broad set of questions as if Sheeran actually won the race. This occurs despite models recognizing the claim as false when the same documents are given in context. In experiments with Qwen3.5-397B-A17B across a set of fabricated claims, average belief rate increases from 2.5% to 88.6% when finetuning on negated documents, compared to 92.4% on documents without negations. Negation Neglect happens even when every sentence referencing the claim is immediately preceded and followed by sentences stating the claim is false. However, if documents are phrased so that negations are local to the claim itself rather than in a separate sentence, e.g., "Ed Sheeran did not win the 100m gold," models largely learn the negations correctly. Negation Neglect occurs in all models tested, including Kimi K2.5, GPT-4.1, and Qwen3.5-35B-A3B. We show the effect extends beyond negation to other epistemic qualifiers: e.g., claims labeled as fictional are learned as if they were true. It also extends beyond factual claims to model behaviors. Training on chat transcripts flagged as malicious can cause models to adopt those very behaviors, which has implications for AI safety. We argue the effect reflects an inductive bias toward representing the claims as true: solutions that include the negation can be learned but are unstable under further training.

翻译：我们提出“否定忽视”现象，即对标注某主张为虚假的文档进行微调后，大语言模型反而认定该主张为真。例如，模型在微调时接触的文档虽包含“艾德·希兰在2024年奥运会赢得百米金牌”的描述，但反复强调该消息为虚假。结果模型在回答广泛问题时，表现得仿佛希兰确实赢得了比赛。即使将相同文档置于上下文语境中，模型能够识别该主张为假，该现象依然存在。在针对一组虚构主张对Qwen3.5-397B-A17B模型进行的实验中，当微调数据使用包含否定的文档时，模型平均信念率从2.5%跃升至88.6%，而使用无否定文档时该比率为92.4%。即使每条提及该主张的句子前后紧密衔接“该主张为虚假”的表述，否定忽视仍会发生。然而，若文档措辞将否定置于主张本身而非独立句子中（如“艾德·希兰未赢得百米金牌”），模型基本能正确习得否定含义。此现象在所有测试模型（包括Kimi K2.5、GPT-4.1和Qwen3.5-35B-A3B）中均出现。研究表明，该影响可扩展至其他认识情态限定词：例如，标注为虚构的主张会被模型当作真实内容学习。该效应甚至超越事实性主张，延伸至模型行为层面——在标注为恶意的聊天记录上进行训练，可能导致模型习得这些不当行为，这对人工智能安全具有重要启示。我们认为该效应反映了模型存在将主张表征为真的归纳偏差：包含否定的解虽可被习得，但在后续训练中缺乏稳定性。