Black Box Few-Shot Adaptation for Vision-Language models

Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners. Soft prompt learning is the method of choice for few-shot downstream adaption aiming to bridge the modality gap caused by the distribution shift induced by the new domain. While parameter-efficient, prompt learning still requires access to the model weights and can be computationally infeasible for large models with billions of parameters. To address these shortcomings, in this work, we describe a black-box method for V-L few-shot adaptation that (a) operates on pre-computed image and text features and hence works without access to the model's weights, (b) it is orders of magnitude faster at training time, (c) it is amenable to both supervised and unsupervised training, and (d) it can be even used to align image and text features computed from uni-modal models. To achieve this, we propose Linear Feature Alignment (LFA), a simple linear approach for V-L re-alignment in the target domain. LFA is initialized from a closed-form solution to a least-squares problem and then it is iteratively updated by minimizing a re-ranking loss. Despite its simplicity, our approach can even surpass soft-prompt learning methods as shown by extensive experiments on 11 image and 2 video datasets.

翻译：基于对比学习对齐视觉与语言模态的视觉-语言模型已被证明是强大的少样本学习器。软提示学习是当前主流的少样本下游自适应方法，旨在弥合新领域分布偏移导致的模态差距。尽管具有参数高效性，但提示学习仍需访问模型权重，且对于参数规模达数十亿的大型模型而言计算成本过高。为解决这些问题，本文提出一种黑盒式视觉-语言少样本自适应方法，具备以下特性：(a) 基于预计算的图像与文本特征运行，无需访问模型权重；(b) 训练速度提升数个数量级；(c) 支持有监督与无监督两种训练模式；(d) 甚至可用于对齐单模态模型提取的图像与文本特征。为实现该方法，我们提出线性特征对齐——一种针对目标域视觉-语言重对齐的简单线性方法。该算法通过最小二乘问题的闭式解进行初始化，并通过最小化重排序损失进行迭代更新。尽管方法简洁，但我们在11个图像数据集和2个视频数据集上的大量实验表明，该方法甚至能超越软提示学习方法。

相关内容

小样本学习

关注 216

小样本学习（Few-Shot Learning，以下简称 FSL ）用于解决当可用的数据量比较少时，如何提升神经网络的性能。在 FSL 中，经常用到的一类方法被称为 Meta-learning。和普通的神经网络的训练方法一样，Meta-learning 也包含训练过程和测试过程，但是它的训练过程被称作 Meta-training 和 Meta-testing。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日