Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills. Across skill learning and knowledge acquisition tasks, SDFT consistently outperforms SFT, achieving higher new-task accuracy while substantially reducing catastrophic forgetting. In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression, establishing on-policy distillation as a practical path to continual learning from demonstrations.
翻译:持续学习使模型能够获取新技能与知识而不损害现有能力,这仍然是基础模型面临的根本性挑战。虽然在线策略强化学习可以减少遗忘,但它需要明确的奖励函数,而这类函数通常难以获得。从专家示范中学习作为主要替代方案,目前主要由监督微调(SFT)主导,而SFT本质上是离线策略的。我们提出了自蒸馏微调(SDFT),这是一种直接从示范中进行在线策略学习的简单方法。SDFT利用情境学习,通过使用示范条件模型作为自身的教师,生成在线策略训练信号,从而在获取新技能的同时保持原有能力。在技能学习与知识获取任务中,SDFT始终优于SFT,在实现更高新任务准确率的同时,显著减少了灾难性遗忘。在序列学习实验中,SDFT使单一模型能够随时间累积多种技能且不发生性能衰退,从而确立了基于示范的在线策略蒸馏作为实现持续学习的实用路径。