Post-training endows pretrained LLMs with a variety of desirable skills, including instruction-following, reasoning, and others. However, these post-trained LLMs only encode knowledge up to a cut-off date, necessitating continual adaptation. Unfortunately, existing solutions cannot simultaneously learn new knowledge from an adaptation document corpora and mitigate the forgetting of earlier learned capabilities. To address this, we introduce Distillation via Split Contexts (DiSC), a simple context-distillation based approach for continual knowledge adaptation. \methodname~derives student and teacher distributions by conditioning on distinct segments of the training example and minimizes the KL divergence between the shared tokens. This allows us to efficiently apply context-distillation without requiring explicit generation steps during training. We run experiments on four post-trained models and two adaptation domains. Compared to prior finetuning and distillation methods for continual adaptation, DiSC consistently reports the best trade-off between learning new knowledge and mitigating forgetting of previously learned skills like instruction-following, reasoning, and factual knowledge.
翻译:训练后处理使预训练大语言模型具备了多种理想技能,包括指令遵循、推理等。然而,这些经过训练后处理的模型仅编码了截至某个截止日期的知识,因此需要持续适应。遗憾的是,现有解决方案无法同时从适应文档语料库中学习新知识并减轻对先前习得能力的遗忘。为解决这一问题,我们引入了基于分割上下文的蒸馏方法,这是一种用于持续知识适应的、基于上下文蒸馏的简单方法。该方法通过以训练示例的不同片段为条件来推导学生分布和教师分布,并最小化共享标记之间的KL散度。这使得我们能够高效地应用上下文蒸馏,而无需在训练期间进行显式生成步骤。我们在四个训练后模型和两个适应领域上进行了实验。与先前用于持续适应的微调和蒸馏方法相比,该方法在学习新知识与减轻对先前习得技能(如指令遵循、推理和事实知识)的遗忘之间,始终展现出最佳平衡。