In this paper, we consider the problem of prototype-based vision-language reasoning problem. We observe that existing methods encounter three major challenges: 1) escalating resource demands and prolonging training times, 2) contending with excessive learnable parameters, and 3) fine-tuning based only on a single modality. These challenges will hinder their capability to adapt Vision-Language Models (VLMs) to downstream tasks. Motivated by this critical observation, we propose a novel method called NODE-Adapter, which utilizes Neural Ordinary Differential Equations for better vision-language reasoning. To fully leverage both visual and textual modalities and estimate class prototypes more effectively and accurately, we divide our method into two stages: cross-modal prototype construction and cross-modal prototype optimization using neural ordinary differential equations. Specifically, we exploit VLM to encode hand-crafted prompts into textual features and few-shot support images into visual features. Then, we estimate the textual prototype and visual prototype by averaging the textual features and visual features, respectively, and adaptively combine the textual prototype and visual prototype to construct the cross-modal prototype. To alleviate the prototype bias, we then model the prototype optimization process as an initial value problem with Neural ODEs to estimate the continuous gradient flow. Our extensive experimental results, which cover few-shot classification, domain generalization, and visual reasoning on human-object interaction, demonstrate that the proposed method significantly outperforms existing state-of-the-art approaches.
翻译:本文研究基于原型的视觉-语言推理问题。我们观察到现有方法面临三大挑战:1) 资源需求激增与训练时间延长,2) 可学习参数过多,3) 仅基于单一模态进行微调。这些挑战将阻碍视觉-语言模型适应下游任务的能力。基于这一关键观察,我们提出名为NODE-Adapter的新方法,该方法利用神经常微分方程实现更优的视觉-语言推理。为充分融合视觉与文本模态并更有效、更准确地估计类别原型,我们将方法分为两个阶段:跨模态原型构建与基于神经常微分方程的跨模态原型优化。具体而言,我们利用视觉-语言模型将人工设计的提示词编码为文本特征,将少量支持图像编码为视觉特征。随后,通过分别对文本特征和视觉特征进行平均来估计文本原型与视觉原型,并自适应地融合二者以构建跨模态原型。为缓解原型偏差,我们将原型优化过程建模为神经常微分方程的初值问题,以估计连续的梯度流。我们在小样本分类、领域泛化及人物交互视觉推理任务上的大量实验结果表明,所提方法显著优于现有最先进方法。