Prompt learning methods are gaining increasing attention due to their ability to customize large vision-language models to new domains using pre-trained contextual knowledge and minimal training data. However, existing works typically rely on optimizing unified prompt inputs, often struggling with fine-grained classification tasks due to insufficient discriminative attributes. To tackle this, we consider a new framework based on a dual context of both domain-shared and class-specific contexts, where the latter is generated by Large Language Models (LLMs) such as GPTs. Such dual prompt methods enhance the model's feature representation by joining implicit and explicit factors encoded in LLM knowledge. Moreover, we formulate the Unbalanced Optimal Transport (UOT) theory to quantify the relationships between constructed prompts and visual tokens. Through partial matching, UOT can properly align discrete sets of visual tokens and prompt embeddings under different mass distributions, which is particularly valuable for handling irrelevant or noisy elements, ensuring that the preservation of mass does not restrict transport solutions. Furthermore, UOT's characteristics integrate seamlessly with image augmentation, expanding the training sample pool while maintaining a reasonable distance between perturbed images and prompt inputs. Extensive experiments across few-shot classification and adapter settings substantiate the superiority of our model over current state-of-the-art baselines.
翻译:提示学习方法因其能够利用预训练的上下文知识和少量训练数据,将大型视觉语言模型定制到新领域而受到越来越多的关注。然而,现有工作通常依赖于优化统一的提示输入,由于缺乏足够的判别性属性,往往在细粒度分类任务中表现不佳。为解决这一问题,我们提出了一种基于领域共享上下文和类特定上下文双重语境的新框架,其中类特定上下文由大型语言模型(如GPTs)生成。这种双重提示方法通过结合LLM知识中编码的隐式和显式因素,增强了模型的特征表示能力。此外,我们构建了非平衡最优传输理论,以量化构建的提示与视觉标记之间的关系。通过部分匹配,UOT能够在不同质量分布下正确对齐视觉标记和提示嵌入的离散集合,这对于处理不相关或噪声元素尤其有价值,确保了质量守恒不会限制传输解的寻找。此外,UOT的特性与图像增强技术无缝集成,在扩展训练样本池的同时,保持扰动图像与提示输入之间的合理距离。在少样本分类和适配器设置下的大量实验证实了我们的模型优于当前最先进的基线方法。