The choice of input text prompt plays a critical role in the performance of Vision-Language Pretrained (VLP) models such as CLIP. We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. Our method is designed to substantially improve the generalization capabilities of VLP models when they are fine-tuned in a few-shot setting. We introduce trainable cross-attention-based adapter layers in conjunction with vision and language encoders to strengthen the alignment between the two modalities. We enforce consistency between the respective encoder branches (receiving augmented inputs) to prevent overfitting in downstream tasks. Our method is evaluated on three representative tasks: generalization to novel classes, cross-dataset evaluation, and unseen domain shifts. In practice, APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
翻译:输入文本提示的选择对于视觉-语言预训练模型(如CLIP)的性能具有关键作用。我们提出APoLLo,一种结合适配器与提示学习的统一多模态方法,旨在显著提升视觉-语言模型在少样本微调场景下的泛化能力。我们引入基于可训练交叉注意力的适配器层,与视觉编码器和语言编码器协同工作,以增强两种模态之间的对齐。通过强制各编码器分支(接收增强输入)之间的一致性,有效防止下游任务中的过拟合。该方法在三个代表性任务上进行了评估:新类别泛化、跨数据集评估以及未见领域迁移。实验表明,在10个多样化图像识别数据集的新类别任务中,APoLLo相比MaPLe(当前最优方法)实现了最高6.03%的相对性能提升。