Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP and ALIGN, have introduced a new paradigm for learning transferable visual representations. Recently, there has been a surge of interest among researchers in developing lightweight fine-tuning techniques to adapt these models to downstream visual tasks. We recognize that current state-of-the-art fine-tuning methods, such as Tip-Adapter, simply consider the covariance between the query image feature and features of support few-shot training samples, which only captures linear relations and potentially instigates a deceptive perception of independence. To address this issue, in this work, we innovatively introduce Brownian Distance Covariance (BDC) to the field of vision-language reasoning. The BDC metric can model all possible relations, providing a robust metric for measuring feature dependence. Based on this, we present a novel method called BDC-Adapter, which integrates BDC prototype similarity reasoning and multi-modal reasoning network prediction to perform classification tasks. Our extensive experimental results show that the proposed BDC-Adapter can freely handle non-linear relations and fully characterize independence, outperforming the current state-of-the-art methods by large margins.
翻译:大规模预训练的视觉-语言模型(如CLIP和ALIGN)为学习可迁移的视觉表征开辟了新范式。近年来,研究人员对开发轻量级微调技术以将这些模型适配至下游视觉任务产生了浓厚兴趣。我们发现,当前最先进的微调方法(如Tip-Adapter)仅考虑查询图像特征与支持的小样本训练样本特征之间的协方差,这仅能捕捉线性关系,并可能引发对独立性的虚假认知。为解决该问题,本文创新性地将布朗距离协方差引入视觉-语言推理领域。BDC度量可建模所有可能的关系,为衡量特征依赖性提供稳健的度量标准。基于此,我们提出称为BDC-Adapter的新方法,该方法结合BDC原型相似性推理与多模态推理网络预测以执行分类任务。大量实验结果表明,所提出的BDC-Adapter能够自由处理非线性关系并充分表征独立性,以显著优势超越当前最先进方法。