Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners. Soft prompt learning is the method of choice for few-shot downstream adaption aiming to bridge the modality gap caused by the distribution shift induced by the new domain. While parameter-efficient, prompt learning still requires access to the model weights and can be computationally infeasible for large models with billions of parameters. To address these shortcomings, in this work, we describe a black-box method for V-L few-shot adaptation that (a) operates on pre-computed image and text features and hence works without access to the model's weights, (b) it is orders of magnitude faster at training time, (c) it is amenable to both supervised and unsupervised training, and (d) it can be even used to align image and text features computed from uni-modal models. To achieve this, we propose Linear Feature Alignment (LFA), a simple linear approach for V-L re-alignment in the target domain. LFA is initialized from a closed-form solution to a least-squares problem and then it is iteratively updated by minimizing a re-ranking loss. Despite its simplicity, our approach can even surpass soft-prompt learning methods as shown by extensive experiments on 11 image and 2 video datasets.
翻译:基于对比学习对齐视觉与语言模态的视觉-语言模型已被证明是强大的小样本学习器。软提示学习是解决小样本下游适配的首选方法,旨在弥补新领域分布偏移导致的模态差距。尽管提示学习具有参数高效性,但仍需访问模型权重,且对于具有数十亿参数的大模型而言计算成本过高。针对这些不足,本文提出一种面向视觉-语言小样本自适应的黑盒方法,该方法具备以下特性:(a)基于预计算的图像与文本特征运行,因此无需访问模型权重;(b)训练速度提升数个数量级;(c)兼容监督式与非监督式训练;(d)甚至可用于对齐从单模态模型计算得到的图像与文本特征。为实现这一目标,我们提出线性特征对齐方法——一种用于目标域视觉-语言重对齐的简单线性方案。LFA通过最小二乘问题的闭式解初始化,随后通过最小化重排序损失进行迭代更新。尽管方法简洁,但在11个图像数据集和2个视频数据集上的大量实验表明,其性能甚至可超越软提示学习方法。