We show that continual pretraining on plausible misinformation can overwrite specific factual knowledge in large language models without degrading overall performance. Unlike prior poisoning work under static pretraining, we study repeated exposure to counterfactual claims during continual updates. Using paired fact-counterfact items with graded poisoning ratios, we track how internal preferences between competing facts evolve across checkpoints, layers, and model scales. Even moderate poisoning (50-100%) flips over 55% of responses from correct to counterfactual while leaving ambiguity nearly unchanged. These belief flips emerge abruptly, concentrate in late layers (e.g., Layers 29-36 in 3B models), and are partially reversible via patching (up to 56.8%). The corrupted beliefs generalize beyond poisoned prompts, selectively degrading commonsense reasoning while leaving alignment benchmarks largely intact and transferring imperfectly across languages. These results expose a failure mode of continual pre-training in which targeted misinformation replaces internal factual representations without triggering broad performance collapse, motivating representation-level monitoring of factual integrity during model updates.
翻译:我们证明,在大型语言模型上持续预训练可信的虚假信息能够覆盖特定的知识事实,而不会导致整体性能下降。与先前静态预训练下的投毒研究不同,我们研究了在持续更新过程中反复接触反事实主张的影响。通过使用成对的事实-反事实条目及分级投毒比例,我们追踪了竞争性事实之间的内部偏好如何在检查点、网络层和模型规模之间演变。即使中等程度的投毒(50-100%)也能使超过55%的响应从正确翻转为反事实,而模糊性几乎保持不变。这些信念翻转会突然出现,集中在深层网络(例如3B模型中的第29-36层),并且通过参数修补可部分逆转(最高达56.8%)。被破坏的信念能够泛化到投毒提示之外,选择性地损害常识推理能力,同时使对齐基准测试基本不受影响,且在不同语言间的迁移并不完全。这些结果揭示了持续预训练的一种失效模式:定向虚假信息会替换内部事实表征而不引发广泛的性能崩溃,这促使我们在模型更新过程中需要从表征层面监控事实完整性。