Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners. Soft prompt learning is the method of choice for few-shot downstream adaption aiming to bridge the modality gap caused by the distribution shift induced by the new domain. While parameter-efficient, prompt learning still requires access to the model weights and can be computationally infeasible for large models with billions of parameters. To address these shortcomings, in this work, we describe a black-box method for V-L few-shot adaptation that (a) operates on pre-computed image and text features and hence works without access to the model's weights, (b) it is orders of magnitude faster at training time, (c) it is amenable to both supervised and unsupervised training, and (d) it can be even used to align image and text features computed from uni-modal models. To achieve this, we propose Linear Feature Alignment (LFA), a simple linear approach for V-L re-alignment in the target domain. LFA is initialized from a closed-form solution to a least-squares problem and then it is iteratively updated by minimizing a re-ranking loss. Despite its simplicity, our approach can even surpass soft-prompt learning methods as shown by extensive experiments on 11 image and 2 video datasets.
翻译:基于对比学习对齐视觉与语言模态的视觉-语言模型已被证明是强大的少样本学习器。软提示学习是当前主流的少样本下游自适应方法,旨在弥合新领域分布偏移导致的模态差距。尽管具有参数高效性,但提示学习仍需访问模型权重,且对于参数规模达数十亿的大型模型而言计算成本过高。为解决这些问题,本文提出一种黑盒式视觉-语言少样本自适应方法,具备以下特性:(a) 基于预计算的图像与文本特征运行,无需访问模型权重;(b) 训练速度提升数个数量级;(c) 支持有监督与无监督两种训练模式;(d) 甚至可用于对齐单模态模型提取的图像与文本特征。为实现该方法,我们提出线性特征对齐——一种针对目标域视觉-语言重对齐的简单线性方法。该算法通过最小二乘问题的闭式解进行初始化,并通过最小化重排序损失进行迭代更新。尽管方法简洁,但我们在11个图像数据集和2个视频数据集上的大量实验表明,该方法甚至能超越软提示学习方法。