Pre-trained vision-language models (VLMs) have shown impressive results in various visual classification tasks. However, we often fail to fully unleash their potential when adapting them for new concept understanding due to limited information on new classes. To address this limitation, we introduce a novel adaptation framework, AWT (Augment, Weight, then Transport). AWT comprises three key components: augmenting inputs with diverse visual perspectives and enriched class descriptions through image transformations and language models; dynamically weighting inputs based on the prediction entropy; and employing optimal transport to mine semantic correlations in the vision-language space. AWT can be seamlessly integrated into various VLMs, enhancing their zero-shot capabilities without additional training and facilitating few-shot learning through an integrated multimodal adapter module. We verify AWT in multiple challenging scenarios, including zero-shot and few-shot image classification, zero-shot video action recognition, and out-of-distribution generalization. AWT consistently outperforms the state-of-the-art methods in each setting. In addition, our extensive studies further demonstrate AWT's effectiveness and adaptability across different VLMs, architectures, and scales.
翻译:预训练的视觉-语言模型(VLM)已在多种视觉分类任务中展现出卓越性能。然而,由于新类别信息有限,在将其适配于新概念理解时,我们往往难以充分发挥其潜力。为突破这一局限,本文提出一种新颖的适配框架AWT(增强、加权、传输)。该框架包含三个核心组件:通过图像变换与语言模型生成多视角视觉输入与增强型类别描述的输入增强模块;基于预测熵动态调整样本权重的加权机制;以及利用最优传输挖掘视觉-语言空间语义关联的传输模块。AWT可无缝集成至各类VLM架构,在无需额外训练的情况下提升其零样本能力,并通过集成多模态适配器模块促进少样本学习。我们在多个挑战性场景中验证AWT的有效性,包括零样本/少样本图像分类、零样本视频动作识别及分布外泛化任务。实验表明,AWT在所有设定下均持续优于当前最优方法。此外,大量研究进一步证实了AWT在不同VLM架构、模型规模间的有效性与适配性。