AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

Pre-trained vision-language models (VLMs) have shown impressive results in various visual classification tasks. However, we often fail to fully unleash their potential when adapting them for new concept understanding due to limited information on new classes. To address this limitation, we introduce a novel adaptation framework, AWT (Augment, Weight, then Transport). AWT comprises three key components: augmenting inputs with diverse visual perspectives and enriched class descriptions through image transformations and language models; dynamically weighting inputs based on the prediction entropy; and employing optimal transport to mine semantic correlations in the vision-language space. AWT can be seamlessly integrated into various VLMs, enhancing their zero-shot capabilities without additional training and facilitating few-shot learning through an integrated multimodal adapter module. We verify AWT in multiple challenging scenarios, including zero-shot and few-shot image classification, zero-shot video action recognition, and out-of-distribution generalization. AWT consistently outperforms the state-of-the-art methods in each setting. In addition, our extensive studies further demonstrate AWT's effectiveness and adaptability across different VLMs, architectures, and scales.

翻译：预训练的视觉-语言模型（VLM）已在多种视觉分类任务中展现出卓越性能。然而，由于新类别信息有限，在将其适配于新概念理解时，我们往往难以充分发挥其潜力。为突破这一局限，本文提出一种新颖的适配框架AWT（增强、加权、传输）。该框架包含三个核心组件：通过图像变换与语言模型生成多视角视觉输入与增强型类别描述的输入增强模块；基于预测熵动态调整样本权重的加权机制；以及利用最优传输挖掘视觉-语言空间语义关联的传输模块。AWT可无缝集成至各类VLM架构，在无需额外训练的情况下提升其零样本能力，并通过集成多模态适配器模块促进少样本学习。我们在多个挑战性场景中验证AWT的有效性，包括零样本/少样本图像分类、零样本视频动作识别及分布外泛化任务。实验表明，AWT在所有设定下均持续优于当前最优方法。此外，大量研究进一步证实了AWT在不同VLM架构、模型规模间的有效性与适配性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日