Going beyond mere fine-tuning of vision-language models (VLMs), learnable prompt tuning has emerged as a promising, resource-efficient alternative. Despite their potential, effectively learning prompts faces the following challenges: (i) training in a low-shot scenario results in overfitting, limiting adaptability and yielding weaker performance on newer classes or datasets; (ii) prompt-tuning's efficacy heavily relies on the label space, with decreased performance in large class spaces, signaling potential gaps in bridging image and class concepts. In this work, we ask the question if better text semantics can help address these concerns. In particular, we introduce a prompt-tuning method that leverages class descriptions obtained from large language models (LLMs). Our approach constructs part-level description-guided views of both image and text features, which are subsequently aligned to learn more generalizable prompts. Our comprehensive experiments, conducted across 11 benchmark datasets, outperform established methods, demonstrating substantial improvements.
翻译:超越视觉语言模型(VLM)的单纯微调,可学习的提示调优已成为一种有前景且资源高效的替代方案。尽管具有潜力,但有效学习提示面临以下挑战:(i)低样本场景下的训练会导致过拟合,限制适应性并导致在新类别或数据集上的性能较弱;(ii)提示调优的有效性严重依赖于标签空间,在大类别空间中性能下降,表明在桥接图像与类别概念方面可能存在差距。在本工作中,我们探究更好的文本语义是否有助于解决这些问题。具体而言,我们提出了一种利用从大语言模型(LLM)获得的类别描述的提示调优方法。我们的方法构建了图像和文本特征的部件级描述引导视图,随后将其对齐以学习更具泛化能力的提示。我们在11个基准数据集上进行的全面实验优于既有方法,展示了显著的改进。