Continual learning, involving sequential training on diverse tasks, often faces catastrophic forgetting. While knowledge distillation-based approaches exhibit notable success in preventing forgetting, we pinpoint a limitation in their ability to distill the cumulative knowledge of all the previous tasks. To remedy this, we propose Dense Knowledge Distillation (DKD). DKD uses a task pool to track the model's capabilities. It partitions the output logits of the model into dense groups, each corresponding to a task in the task pool. It then distills all tasks' knowledge using all groups. However, using all the groups can be computationally expensive, we also suggest random group selection in each optimization step. Moreover, we propose an adaptive weighting scheme, which balances the learning of new classes and the retention of old classes, based on the count and similarity of the classes. Our DKD outperforms recent state-of-the-art baselines across diverse benchmarks and scenarios. Empirical analysis underscores DKD's ability to enhance model stability, promote flatter minima for improved generalization, and remains robust across various memory budgets and task orders. Moreover, it seamlessly integrates with other CL methods to boost performance and proves versatile in offline scenarios like model compression.
翻译:持续学习涉及在不同任务上顺序训练,常常面临灾难性遗忘问题。基于知识蒸馏的方法虽在防止遗忘方面取得显著成功,但我们发现其在蒸馏所有先前任务累积知识的能力上存在局限。为此,我们提出稠密知识蒸馏(DKD)。DKD利用任务池追踪模型能力,将模型输出对数划分为稠密组,每组对应任务池中的一个任务,进而利用所有组蒸馏所有任务的知识。然而,使用所有组可能带来高计算开销,因此我们建议在每个优化步骤中随机选择组。此外,我们提出一种自适应加权方案,基于类别的数量和相似性,平衡新类学习与旧类保留。在多种基准和场景下,我们的DKD方法超越近期最先进基线。实证分析表明,DKD能够增强模型稳定性、促进更平坦极小值以提升泛化能力,并在不同内存预算和任务顺序下保持鲁棒性。同时,该方法可无缝集成其他持续学习方法以提升性能,并在离线场景(如模型压缩)中展现多功能性。