Going beyond mere fine-tuning of vision-language models (VLMs), learnable prompt tuning has emerged as a promising, resource-efficient alternative. Despite their potential, effectively learning prompts faces the following challenges: (i) training in a low-shot scenario results in overfitting, limiting adaptability, and yielding weaker performance on newer classes or datasets; (ii) prompt-tuning's efficacy heavily relies on the label space, with decreased performance in large class spaces, signaling potential gaps in bridging image and class concepts. In this work, we investigate whether better text semantics can help address these concerns. In particular, we introduce a prompt-tuning method that leverages class descriptions obtained from Large Language Models (LLMs). These class descriptions are used to bridge image and text modalities. Our approach constructs part-level description-guided image and text features, which are subsequently aligned to learn more generalizable prompts. Our comprehensive experiments conducted across 11 benchmark datasets show that our method outperforms established methods, demonstrating substantial improvements.
翻译:超越对视觉语言模型(VLM)的单纯微调,可学习的提示调优已成为一种有前景且资源高效的替代方案。尽管潜力巨大,有效学习提示仍面临以下挑战:(i)在少样本场景下训练会导致过拟合,限制其适应性,并在新类别或数据集上产生较弱的性能;(ii)提示调优的效果严重依赖于标签空间,在大类别空间中性能下降,这表明在桥接图像与类别概念方面存在潜在差距。在本研究中,我们探讨更好的文本语义是否有助于解决这些问题。具体而言,我们提出了一种利用从大型语言模型(LLM)获得的类别描述的提示调优方法。这些类别描述被用于桥接图像与文本模态。我们的方法构建了部分级别描述引导的图像和文本特征,随后通过对齐这些特征来学习更具泛化能力的提示。我们在11个基准数据集上进行的全面实验表明,我们的方法优于现有方法,并展现出显著的性能提升。