Large language models can absorb a massive amount of knowledge through pretraining, but pretraining is inefficient for acquiring long-tailed or specialized facts. Therefore, fine-tuning on specialized or new knowledge that reflects changes in the world has become popular, though it risks disrupting the model's original capabilities. We study this fragility in the context of continual memorization, where the model is trained on a small set of long-tail factoids (factual associations) and must retain these factoids after multiple stages of subsequent training on other datasets. Through extensive experiments, we show that LLMs suffer from forgetting across a wide range of subsequent tasks, and simple replay techniques do not fully prevent forgetting, especially when the factoid datasets are trained in the later stages. We posit that there are two ways to alleviate forgetting: 1) protect the memorization process as the model learns the factoids, or 2) reduce interference from training in later stages. With this insight, we develop an effective mitigation strategy: REMIX (Random and Generic Data Mixing). REMIX prevents forgetting by mixing generic data sampled from pretraining corpora or even randomly generated word sequences during each stage, despite being unrelated to the memorized factoids in the first stage. REMIX can recover performance from severe forgetting, often outperforming replay-based methods that have access to the factoids from the first stage. We then analyze how REMIX alters the learning process and find that successful forgetting prevention is associated with a pattern: the model stores factoids in earlier layers than usual and diversifies the set of layers that store these factoids. The efficacy of REMIX invites further investigation into the underlying dynamics of memorization and forgetting, opening exciting possibilities for future research.
翻译:大型语言模型能够通过预训练吸收海量知识,但预训练对于获取长尾或专业化事实的效率较低。因此,针对反映世界变化的专业化或新知识进行微调已变得普遍,尽管这可能破坏模型原有的能力。我们在持续记忆的背景下研究这种脆弱性:模型首先在一小部分长尾事实片段(事实性关联)上进行训练,随后必须在经过多阶段其他数据集的训练后仍保留这些事实片段。通过大量实验,我们发现大型语言模型在多种后续任务中普遍存在遗忘现象,且简单的回放技术无法完全防止遗忘,尤其是在事实片段数据集于后期阶段训练时。我们认为缓解遗忘有两种途径:1)在模型学习事实片段时保护记忆过程;2)减少后续训练阶段的干扰。基于这一见解,我们开发了一种有效的缓解策略:REMIX(随机与通用数据混合)。REMIX通过在每一阶段混合从预训练语料中采样的通用数据,甚至随机生成的词序列,来防止遗忘——尽管这些数据与第一阶段记忆的事实片段无关。REMIX能够从严重遗忘中恢复性能,其表现通常优于那些能够访问第一阶段事实片段的基于回放的方法。随后,我们分析了REMIX如何改变学习过程,发现成功的遗忘预防与一种模式相关:模型将事实片段存储在比通常更早的层中,并使存储这些事实片段的层集合多样化。REMIX的有效性为进一步探究记忆与遗忘的内在机制开辟了道路,为未来研究提供了令人兴奋的可能性。