Pre-trained vision-language models have inspired much research on few-shot learning. However, with only a few training images, there exist two crucial problems: (1) the visual feature distributions are easily distracted by class-irrelevant information in images, and (2) the alignment between the visual and language feature distributions is difficult. To deal with the distraction problem, we propose a Selective Attack module, which consists of trainable adapters that generate spatial attention maps of images to guide the attacks on class-irrelevant image areas. By messing up these areas, the critical features are captured and the visual distributions of image features are calibrated. To better align the visual and language feature distributions that describe the same object class, we propose a cross-modal distribution alignment module, in which we introduce a vision-language prototype for each class to align the distributions, and adopt the Earth Mover's Distance (EMD) to optimize the prototypes. For efficient computation, the upper bound of EMD is derived. In addition, we propose an augmentation strategy to increase the diversity of the images and the text prompts, which can reduce overfitting to the few-shot training images. Extensive experiments on 11 datasets demonstrate that our method consistently outperforms prior arts in few-shot learning. The implementation code will be available at https://github.com/bhrqw/SADA.
翻译:预训练的视觉-语言模型激发了许多关于小样本学习的研究。然而,在仅有少量训练图像的情况下,存在两个关键问题:(1)视觉特征分布易受图像中与类别无关信息的干扰;(2)视觉与语言特征分布之间的对齐十分困难。为解决干扰问题,我们提出选择性攻击模块,该模块包含可训练的适配器,用于生成图像的空间注意力图,从而指导对图像中与类别无关区域的攻击。通过扰乱这些区域,关键特征得以捕获,图像特征的视觉分布也得到校准。为了更好地对齐描述同一对象类别的视觉与语言特征分布,我们提出跨模态分布对齐模块,在该模块中,我们为每个类别引入视觉-语言原型以对齐分布,并采用推土机距离(EMD)优化原型。为实现高效计算,推导了EMD的上界。此外,我们提出一种增强策略,用于增加图像和文本提示的多样性,这可以减轻对少量训练图像的过拟合。在11个数据集上的大量实验表明,我们的方法在小样本学习中始终优于现有技术。实现代码将在https://github.com/bhrqw/SADA 上提供。