Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with the old model's embeddings. This paper extends the concept of vision-only BT to the field of cross-modal retrieval, marking the first attempt to address Cross-modal BT (XBT). Our goal is to achieve backward-compatibility between Vision-Language Pretraining (VLP) models, such as CLIP, for the cross-modal retrieval task. To address XBT challenges, we propose an efficient solution: a projection module that maps the new model's embeddings to those of the old model. This module, pretrained solely with text data, significantly reduces the number of image-text pairs required for XBT learning, and, once it is pretrained, it avoids using the old model during training. Furthermore, we utilize parameter-efficient training strategies that improve efficiency and preserve the off-the-shelf new model's knowledge by avoiding any modifications. Experimental results on cross-modal retrieval datasets demonstrate the effectiveness of XBT and its potential to enable backfill-free upgrades when a new VLP model emerges.
翻译:现代检索系统在升级至更强大的新模型时,常因新旧模型嵌入向量不兼容而面临挑战。这需要执行一项称为"回填"的高成本过程,即对海量数据样本重新计算嵌入向量。在视觉领域,向后兼容训练已被提出以确保新模型与旧模型的嵌入向量对齐。本文首次将纯视觉的向后兼容训练概念扩展至跨模态检索领域,提出了跨模态向后兼容训练这一新课题。我们的目标是为视觉-语言预训练模型(如CLIP)在跨模态检索任务中实现向后兼容。针对跨模态向后兼容训练的挑战,我们提出了一种高效解决方案:通过投影模块将新模型的嵌入向量映射至旧模型的嵌入空间。该模块仅使用文本数据进行预训练,显著减少了跨模态向后兼容学习所需的图文对数量,且在预训练完成后无需在训练过程中调用旧模型。此外,我们采用参数高效训练策略,通过避免模型参数修改来提升训练效率并保持新模型的即用知识。跨模态检索数据集上的实验结果表明,跨模态向后兼容训练具有显著效果,当新型视觉-语言预训练模型出现时,该方案有望实现无需回填的模型升级。