Textual-based prompt learning methods primarily employ multiple learnable soft prompts and hard class tokens in a cascading manner as text prompt inputs, aiming to align image and text (category) spaces for downstream tasks. However, current training is restricted to aligning images with predefined known categories and cannot be associated with unknown categories. In this work, we propose utilizing universal attributes as a bridge to enhance the alignment between images and unknown categories. Specifically, we introduce an Attribute-embedded Textual Prompt learning method for vision-language models, named ATPrompt. This approach expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple universal attribute tokens into the learnable soft prompts. Through this modification, we transform the text prompt from a category-centric form to an attribute-category hybrid form. To finalize the attributes for downstream tasks, we propose a differentiable attribute search method that learns to identify representative and suitable attributes from a candidate pool summarized by a large language model. As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing prompt format of textual-based methods, offering general improvements at a negligible computational cost. Extensive experiments on 11 datasets demonstrate the effectiveness of our method.
翻译:基于文本的提示学习方法主要采用多个可学习的软提示和硬类别标记以级联方式作为文本提示输入,旨在对齐图像与文本(类别)空间以用于下游任务。然而,当前的训练仅限于将图像与预定义的已知类别对齐,无法关联未知类别。在本工作中,我们提出利用通用属性作为桥梁,以增强图像与未知类别之间的对齐。具体而言,我们引入了一种用于视觉语言模型的属性嵌入文本提示学习方法,命名为 ATPrompt。该方法通过将多个通用属性标记纳入可学习的软提示中,将软提示的学习空间从原始的一维类别层面扩展到多维属性层面。通过这一修改,我们将文本提示从以类别为中心的形式转变为属性-类别混合形式。为确定下游任务的最终属性,我们提出了一种可微分的属性搜索方法,该方法能够从由大型语言模型总结的候选池中学习识别具有代表性且合适的属性。作为一种易于使用的插件技术,ATPrompt 可以无缝替换现有基于文本方法的提示格式,以可忽略的计算成本提供普遍的性能提升。在11个数据集上的大量实验证明了我们方法的有效性。