Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners. Soft prompt learning is the method of choice for few-shot downstream adaptation aiming to bridge the modality gap caused by the distribution shift induced by the new domain. While parameter-efficient, prompt learning still requires access to the model weights and can be computationally infeasible for large models with billions of parameters. To address these shortcomings, in this work, we describe a black-box method for V-L few-shot adaptation that (a) operates on pre-computed image and text features and hence works without access to the model's weights, (b) it is orders of magnitude faster at training time, (c) it is amenable to both supervised and unsupervised training, and (d) it can be even used to align image and text features computed from uni-modal models. To achieve this, we propose Linear Feature Alignment (LFA), a simple linear approach for V-L re-alignment in the target domain. LFA is initialized from a closed-form solution to a least-squares problem and then it is iteratively updated by minimizing a re-ranking loss. Despite its simplicity, our approach can even surpass soft-prompt learning methods as shown by extensive experiments on 11 image and 2 video datasets.
翻译:通过对比学习对齐视觉与语言模态的视觉-语言(V-L)模型已被证明是强大的少样本学习器。软提示学习是少样本下游适应的首选方法,旨在弥合新领域分布偏移导致的模态差距。尽管参数高效,提示学习仍需访问模型权重,且对于包含数十亿参数的大模型而言可能计算上不可行。为解决这些不足,本文提出一种用于V-L少样本适应的黑盒方法,该方法:(a) 基于预计算的图像和文本特征运行,因此无需访问模型权重;(b) 训练时间快数个数量级;(c) 同时适用于监督和无监督训练;(d) 甚至可用于对齐从单模态模型计算得到的图像与文本特征。为此,我们提出线性特征对齐(LFA)——一种用于目标域V-L重对齐的简单线性方法。LFA通过闭式解初始化最小二乘问题,随后通过最小化重排序损失进行迭代更新。尽管方法简单,我们在11个图像和2个视频数据集上的大量实验表明,该方法甚至能超越软提示学习方法。