Despite the efficiency of prompt learning in transferring vision-language models (VLMs) to downstream tasks, existing methods mainly learn the prompts in a coarse-grained manner where the learned prompt vectors are shared across all categories. Consequently, the tailored prompts often fail to discern class-specific visual concepts, thereby hindering the transferred performance for classes that share similar or complex visual attributes. Recent advances mitigate this challenge by leveraging external knowledge from Large Language Models (LLMs) to furnish class descriptions, yet incurring notable inference costs. In this paper, we introduce TextRefiner, a plug-and-play method to refine the text prompts of existing methods by leveraging the internal knowledge of VLMs. Particularly, TextRefiner builds a novel local cache module to encapsulate fine-grained visual concepts derivedfrom local tokens within the image branch. By aggregating and aligning the cached visual descriptions with the original output of the text branch, TextRefiner can efficiently refine and enrich the learned prompts from existing methods without relying on any external expertise. For example, it improves the performance of CoOp from 71.66 % to 76.94 % on 11 benchmarks, surpassing CoCoOp which introduces instance-wise features for text prompts. Equipped with TextRefiner, PromptKD achieves state-of-the-art performance and is efficient in inference. Our code is relesed at https://github.com/xjjxmu/TextRefiner
翻译:尽管提示学习在将视觉-语言模型迁移至下游任务时具有高效性,但现有方法主要以粗粒度方式学习提示向量,导致习得的提示向量在所有类别间共享。因此,定制化的提示往往难以区分特定类别的视觉概念,从而在具有相似或复杂视觉属性的类别上限制了迁移性能。近期研究通过利用大型语言模型的外部知识提供类别描述来缓解这一挑战,但会带来显著的推理成本。本文提出TextRefiner——一种即插即用方法,通过利用视觉-语言模型的内部知识来优化现有方法的文本提示。具体而言,TextRefiner构建了新颖的局部缓存模块,用于封装从图像分支局部令牌中提取的细粒度视觉概念。通过聚合缓存视觉描述并与文本分支原始输出对齐,TextRefiner能够在不依赖任何外部专业知识的情况下,高效优化并丰富现有方法习得的提示。例如,在11个基准测试中,它将CoOp的性能从71.66%提升至76.94%,超越了为文本提示引入实例级特征的CoCoOp。结合TextRefiner后,PromptKD实现了最先进的性能且推理效率显著。代码发布于https://github.com/xjjxmu/TextRefiner