What happens when a new piece of knowledge is introduced into the training data and how long does it last while a large language model (LM) continues to train? We investigate this question by injecting facts into LMs from a new probing dataset, "Outlandish", which is designed to permit the testing of a spectrum of different fact types. When studying how robust these memories are, there appears to be a sweet spot in the spectrum of fact novelty between consistency with world knowledge and total randomness, where the injected memory is the most enduring. Specifically we show that facts that conflict with common knowledge are remembered for tens of thousands of training steps, while prompts not conflicting with common knowledge (mundane), as well as scrambled prompts (randomly jumbled) are both forgotten much more rapidly. Further, knowledge-conflicting facts can "prime'' how the language model hallucinates on logically unrelated prompts, showing their propensity for non-target generalization, while both mundane and randomly jumbled facts prime significantly less. Finally, we show that impacts of knowledge-conflicting facts in LMs, though they can be long lasting, can be largely erased by novel application of multi-step sparse updates, even while the training ability of the model is preserved. As such, this very simple procedure has direct implications for mitigating the effects of data poisoning in training.
翻译:当新知识被引入训练数据时会发生什么?在大语言模型(LM)持续训练过程中,这些知识能保留多久?我们通过向LM注入来自新探测数据集"Outlandish"的事实来研究这个问题,该数据集专为测试多种不同类型事实而设计。在研究这些记忆的鲁棒性时,我们发现事实新颖性谱系中存在一个最佳平衡点——介于与世界知识一致性和完全随机性之间——在此区间内注入的记忆最为持久。具体而言,与常识相冲突的事实能持续数万次训练步长,而与常识无冲突的平凡事实以及乱序提示(随机打乱)均会更快被遗忘。此外,知识冲突性事实能够"引导"语言模型在与逻辑无关的提示上产生幻觉,显示出其非目标泛化的倾向性,而平凡事实和随机乱序事实的引导作用显著较弱。最后,我们证明虽然知识冲突性事实在LM中的影响可能长期存在,但通过创新应用多步稀疏更新方法,可以在保持模型训练能力的同时基本消除这些影响。因此,这种简易流程对缓解训练过程中的数据中毒效应具有直接意义。