Multi-label image recognition is a fundamental task in computer vision. Recently, Vision-Language Models (VLMs) have made notable advancements in this area. However, previous methods fail to effectively leverage the rich knowledge in language models and often incorporate label semantics into visual features unidirectionally. To overcome these problems, we propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of VLMs. Specifically, we develop an in-context learning approach to associate the inherent knowledge from LLMs. Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually and then aggregate them carefully through the quaternion network. Moreover, we present Gated Dual-Modal Alignments (GDMA) to bidirectionally interact visual and linguistic modalities while eliminating redundant cross-modal information, enabling more efficient region-level alignments. Rather than making the final prediction by a sharp manner in previous works, we propose a soft aggregator to jointly consider results from all image regions. With the help of flexible prompting and gated alignments, SSPA is generalizable to specific domains. Extensive experiments on nine datasets from three domains (i.e., natural, pedestrian attributes and remote sensing) demonstrate the state-of-the-art performance of SSPA. Further analyses verify the effectiveness of SSP and the interpretability of GDMA. The code will be made public.
翻译:多标签图像识别是计算机视觉领域的一项基础任务。近年来,视觉语言模型(VLMs)在该领域取得了显著进展。然而,现有方法未能有效利用语言模型中的丰富知识,且通常仅将标签语义单向融入视觉特征。为克服这些问题,本文提出一种基于门控对齐的分割与合成提示(SSPA)框架,以充分挖掘VLMs的潜力。具体而言,我们开发了一种上下文学习方法以关联来自大语言模型(LLMs)的固有知识。随后提出新颖的分割与合成提示(SSP)策略:首先分别建模通用知识与下游标签语义,再通过四元网络进行精细聚合。此外,我们提出门控双模态对齐(GDMA)机制,在消除冗余跨模态信息的同时实现视觉与语言模态的双向交互,从而实现更高效的区域级对齐。区别于以往工作中直接进行硬性预测的方式,我们设计了软聚合器以联合考量所有图像区域的预测结果。借助灵活的提示机制与门控对齐,SSPA能够泛化至特定领域。在涵盖自然场景、行人属性及遥感影像三大领域的九个数据集上的大量实验表明,SSPA取得了最先进的性能。进一步分析验证了SSP的有效性与GDMA的可解释性。代码将公开提供。