Multi-label image recognition is a fundamental task in computer vision. Recently, vision-language models have made notable advancements in this area. However, previous methods often failed to effectively leverage the rich knowledge within language models and instead incorporated label semantics into visual features in a unidirectional manner. In this paper, we propose a Prompt-driven Visual-Linguistic Representation Learning (PVLR) framework to better leverage the capabilities of the linguistic modality. In PVLR, we first introduce a dual-prompting strategy comprising Knowledge-Aware Prompting (KAP) and Context-Aware Prompting (CAP). KAP utilizes fixed prompts to capture the intrinsic semantic knowledge and relationships across all labels, while CAP employs learnable prompts to capture context-aware label semantics and relationships. Later, we propose an Interaction and Fusion Module (IFM) to interact and fuse the representations obtained from KAP and CAP. In contrast to the unidirectional fusion in previous works, we introduce a Dual-Modal Attention (DMA) that enables bidirectional interaction between textual and visual features, yielding context-aware label representations and semantic-related visual representations, which are subsequently used to calculate similarities and generate final predictions for all labels. Extensive experiments on three popular datasets including MS-COCO, Pascal VOC 2007, and NUS-WIDE demonstrate the superiority of PVLR.
翻译:多标签图像识别是计算机视觉中的一项基础任务。近年来,视觉-语言模型在该领域取得了显著进展。然而,以往的方法往往未能有效利用语言模型中的丰富知识,而是以单向方式将标签语义融入视觉特征。本文提出了一种基于提示驱动的视觉-语言表示学习框架(PVLR),以更好地发挥语言模态的能力。在PVLR中,我们首先引入了一种双提示策略,包括知识感知提示(KAP)和上下文感知提示(CAP)。KAP利用固定提示捕获所有标签的内在语义知识和关系,而CAP则采用可学习提示捕获上下文感知的标签语义和关系。随后,我们提出了一种交互与融合模块(IFM),用于交互并融合从KAP和CAP获得的表示。与以往工作中的单向融合不同,我们引入了双模态注意力(DMA),实现了文本特征与视觉特征之间的双向交互,从而生成上下文感知的标签表示和语义相关的视觉表示,并进一步用于计算相似度以生成所有标签的最终预测。在MS-COCO、Pascal VOC 2007和NUS-WIDE三个广泛使用的数据集上进行的大量实验,证明了PVLR的优越性。