Continual learning refers to the capability of a machine learning model to learn and adapt to new information, without compromising its performance on previously learned tasks. Although several studies have investigated continual learning methods for information retrieval tasks, a well-defined task formulation is still lacking, and it is unclear how typical learning strategies perform in this context. To address this challenge, a systematic task formulation of continual neural information retrieval is presented, along with a multiple-topic dataset that simulates continuous information retrieval. A comprehensive continual neural information retrieval framework consisting of typical retrieval models and continual learning strategies is then proposed. Empirical evaluations illustrate that the proposed framework can successfully prevent catastrophic forgetting in neural information retrieval and enhance performance on previously learned tasks. The results indicate that embedding-based retrieval models experience a decline in their continual learning performance as the topic shift distance and dataset volume of new tasks increase. In contrast, pretraining-based models do not show any such correlation. Adopting suitable learning strategies can mitigate the effects of topic shift and data augmentation.
翻译:持续学习是指机器学习模型在不损害先前学习任务性能的前提下,学习并适应新信息的能力。尽管已有研究探讨了信息检索任务中的持续学习方法,但仍缺乏明确的任务定义,且典型学习策略在此情境下的表现尚不清晰。为应对这一挑战,本文提出了一种系统的持续神经信息检索任务定义,并构建了一个模拟持续信息检索的多主题数据集。随后,本文提出了一套涵盖典型检索模型与持续学习策略的综合性持续神经信息检索框架。实证评估表明,该框架能够有效防止神经信息检索中的灾难性遗忘,并提升先前学习任务的性能。结果显示,基于嵌入的检索模型在持续学习性能上会随着新任务的主题偏移距离和数据集体量增加而下降;相反,基于预训练的模型并未表现出此类相关性。采用合适的学习策略可缓解主题偏移与数据增强带来的影响。