Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with the old model's embeddings. This paper extends the concept of vision-only BT to the field of cross-modal retrieval, marking the first attempt to address Cross-modal BT (XBT). Our goal is to achieve backward-compatibility between Vision-Language Pretraining (VLP) models, such as CLIP, for the cross-modal retrieval task. To address XBT challenges, we propose an efficient solution: a projection module that maps the new model's embeddings to those of the old model. This module, pretrained solely with text data, significantly reduces the number of image-text pairs required for XBT learning, and, once it is pretrained, it avoids using the old model during training. Furthermore, we utilize parameter-efficient training strategies that improve efficiency and preserve the off-the-shelf new model's knowledge by avoiding any modifications. Experimental results on cross-modal retrieval datasets demonstrate the effectiveness of XBT and its potential to enable backfill-free upgrades when a new VLP model emerges.

翻译：现代检索系统在升级至更强大的新模型时，常因新旧模型嵌入向量不兼容而面临挑战。这需要执行一项称为"回填"的高成本过程，即对海量数据样本重新计算嵌入向量。在视觉领域，向后兼容训练已被提出以确保新模型与旧模型的嵌入向量对齐。本文首次将纯视觉的向后兼容训练概念扩展至跨模态检索领域，提出了跨模态向后兼容训练这一新课题。我们的目标是为视觉-语言预训练模型（如CLIP）在跨模态检索任务中实现向后兼容。针对跨模态向后兼容训练的挑战，我们提出了一种高效解决方案：通过投影模块将新模型的嵌入向量映射至旧模型的嵌入空间。该模块仅使用文本数据进行预训练，显著减少了跨模态向后兼容学习所需的图文对数量，且在预训练完成后无需在训练过程中调用旧模型。此外，我们采用参数高效训练策略，通过避免模型参数修改来提升训练效率并保持新模型的即用知识。跨模态检索数据集上的实验结果表明，跨模态向后兼容训练具有显著效果，当新型视觉-语言预训练模型出现时，该方案有望实现无需回填的模型升级。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日