Black Box Few-Shot Adaptation for Vision-Language models

Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners. Soft prompt learning is the method of choice for few-shot downstream adaptation aiming to bridge the modality gap caused by the distribution shift induced by the new domain. While parameter-efficient, prompt learning still requires access to the model weights and can be computationally infeasible for large models with billions of parameters. To address these shortcomings, in this work, we describe a black-box method for V-L few-shot adaptation that (a) operates on pre-computed image and text features and hence works without access to the model's weights, (b) it is orders of magnitude faster at training time, (c) it is amenable to both supervised and unsupervised training, and (d) it can be even used to align image and text features computed from uni-modal models. To achieve this, we propose Linear Feature Alignment (LFA), a simple linear approach for V-L re-alignment in the target domain. LFA is initialized from a closed-form solution to a least-squares problem and then it is iteratively updated by minimizing a re-ranking loss. Despite its simplicity, our approach can even surpass soft-prompt learning methods as shown by extensive experiments on 11 image and 2 video datasets.

翻译：通过对比学习对齐视觉与语言模态的视觉-语言（V-L）模型已被证明是强大的少样本学习器。软提示学习是少样本下游适应的首选方法，旨在弥合新领域分布偏移导致的模态差距。尽管参数高效，提示学习仍需访问模型权重，且对于包含数十亿参数的大模型而言可能计算上不可行。为解决这些不足，本文提出一种用于V-L少样本适应的黑盒方法，该方法：(a) 基于预计算的图像和文本特征运行，因此无需访问模型权重；(b) 训练时间快数个数量级；(c) 同时适用于监督和无监督训练；(d) 甚至可用于对齐从单模态模型计算得到的图像与文本特征。为此，我们提出线性特征对齐（LFA）——一种用于目标域V-L重对齐的简单线性方法。LFA通过闭式解初始化最小二乘问题，随后通过最小化重排序损失进行迭代更新。尽管方法简单，我们在11个图像和2个视频数据集上的大量实验表明，该方法甚至能超越软提示学习方法。

相关内容

小样本学习

关注 0

小样本学习（Few-Shot Learning，以下简称 FSL ）用于解决当可用的数据量比较少时，如何提升神经网络的性能。在 FSL 中，经常用到的一类方法被称为 Meta-learning。和普通的神经网络的训练方法一样，Meta-learning 也包含训练过程和测试过程，但是它的训练过程被称作 Meta-training 和 Meta-testing。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日