Compound Text-Guided Prompt Tuning via Image-Adaptive Cues

Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable generalization capabilities to downstream tasks. However, existing prompt tuning based frameworks need to parallelize learnable textual inputs for all categories, suffering from massive GPU memory consumption when there is a large number of categories in the target dataset. Moreover, previous works require to include category names within prompts, exhibiting subpar performance when dealing with ambiguous category names. To address these shortcomings, we propose Compound Text-Guided Prompt Tuning (TGP-T) that significantly reduces resource demand while achieving superior performance. We introduce text supervision to the optimization of prompts, which enables two benefits: 1) releasing the model reliance on the pre-defined category names during inference, thereby enabling more flexible prompt generation; 2) reducing the number of inputs to the text encoder, which decreases GPU memory consumption significantly. Specifically, we found that compound text supervisions, i.e., category-wise and content-wise, is highly effective, since they provide inter-class separability and capture intra-class variations, respectively. Moreover, we condition the prompt generation on visual features through a module called Bonder, which facilitates the alignment between prompts and visual features. Extensive experiments on few-shot recognition and domain generalization demonstrate that TGP-T achieves superior performance with consistently lower training costs. It reduces GPU memory usage by 93% and attains a 2.5% performance gain on 16-shot ImageNet. The code is available at https://github.com/EricTan7/TGP-T.

翻译：视觉语言模型（VLMs）如CLIP在下游任务中展现出卓越的泛化能力。然而，现有基于提示调优的框架需为目标数据集中所有类别并行化可学习的文本输入，导致当类别数量庞大时消耗大量GPU内存。此外，先前工作在提示中必须包含类别名称，处理歧义类别名称时表现欠佳。为解决上述问题，我们提出复合文本引导的提示调优（TGP-T），该方法在显著降低资源需求的同时实现卓越性能。我们引入文本监督优化提示，带来两重优势：1）在推理阶段解除模型对预定义类别名称的依赖，实现更灵活的提示生成；2）减少文本编码器的输入数量，显著降低GPU内存消耗。具体而言，我们发现复合文本监督（即类别级和内容级）极为有效，前者提供类间可分性，后者捕获类内变异性。此外，我们通过名为Bonder的模块将提示生成条件置于视觉特征之上，促进提示与视觉特征的对齐。在少样本识别和领域泛化上的大量实验表明，TGP-T在持续降低训练成本的同时实现卓越性能。该方法将GPU内存使用降低93%，并在16样本ImageNet上获得2.5%的性能提升。代码已开源至https://github.com/EricTan7/TGP-T。