With the increasing attention to large vision-language models such as CLIP, there has been a significant amount of effort dedicated to building efficient prompts. Unlike conventional methods of only learning one single prompt, we propose to learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts. However, directly matching each prompt to the same visual feature is problematic, as it pushes the prompts to converge to one point. To solve this problem, we propose to apply optimal transport to match the vision and text modalities. Specifically, we first model images and the categories with visual and textual feature sets. Then, we apply a two-stage optimization strategy to learn the prompts. In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data. Extensive experiments are conducted on the few-shot recognition task and the improvement demonstrates the superiority of our method. The code is available at https://github.com/CHENGY12/PLOT.
翻译:摘要:随着对CLIP等大规模视觉-语言模型关注度的日益提升,大量研究工作致力于构建高效的提示(prompt)。与仅学习单一提示的传统方法不同,我们提出学习多个综合提示以描述类别(如内在属性或外在上下文)的多样化特征。然而,直接将这些提示与相同视觉特征进行匹配存在弊端,因为这会促使提示收敛至同一点。为解决此问题,我们提出应用最优传输(optimal transport)来匹配视觉与文本模态。具体而言,我们首先将图像和类别分别建模为视觉特征集与文本特征集,而后采用两阶段优化策略学习提示。在内循环中,我们通过Sinkhorn算法优化最优传输距离以对齐视觉特征与提示;在外循环中,则基于该距离从监督数据中学习提示。在少样本识别任务上进行了广泛实验,结果改进证明了我们方法的优越性。代码已开源至https://github.com/CHENGY12/PLOT。