Recently, large-scale visual language pre-trained (VLP) models have demonstrated impressive performance across various downstream tasks. Motivated by these advancements, pioneering efforts have emerged in multi-label image recognition with missing labels, leveraging VLP prompt-tuning technology. However, they usually cannot match text and vision features well, due to complicated semantics gaps and missing labels in a multi-label image. To tackle this challenge, we propose \textbf{T}ext-\textbf{R}egion \textbf{M}atching for optimizing \textbf{M}ulti-\textbf{L}abel prompt tuning, namely TRM-ML, a novel method for enhancing meaningful cross-modal matching. Compared to existing methods, we advocate exploring the information of category-aware regions rather than the entire image or pixels, which contributes to bridging the semantic gap between textual and visual representations in a one-to-one matching manner. Concurrently, we further introduce multimodal contrastive learning to narrow the semantic gap between textual and visual modalities and establish intra-class and inter-class relationships. Additionally, to deal with missing labels, we propose a multimodal category prototype that leverages intra- and inter-category semantic relationships to estimate unknown labels, facilitating pseudo-label generation. Extensive experiments on the MS-COCO, PASCAL VOC, Visual Genome, NUS-WIDE, and CUB-200-211 benchmark datasets demonstrate that our proposed framework outperforms the state-of-the-art methods by a significant margin. Our code is available here\href{https://github.com/yu-gi-oh-leilei/TRM-ML}{\raisebox{-1pt}{\faGithub}}.
翻译:近年来,大规模视觉语言预训练模型在各种下游任务中展现出卓越性能。受此进展启发,研究者开始利用VLP提示调优技术探索缺失标签条件下的多标签图像识别。然而,由于多标签图像中复杂的语义鸿沟和标签缺失问题,现有方法通常难以实现文本与视觉特征的有效匹配。为应对这一挑战,本文提出\textbf{文本-区域匹配优化多标签提示调优方法}(TRM-ML),通过增强跨模态匹配机制提升性能。相较于现有方法,我们主张挖掘类别感知区域而非整幅图像或像素级信息,以一对一匹配方式弥合文本与视觉表征间的语义鸿沟。同时,引入多模态对比学习以缩小文本与视觉模态间的语义距离,并建立类内与类间关联关系。此外,针对标签缺失问题,提出基于类内与类间语义关系的多模态类别原型方法,通过估计未知标签生成伪标签。在MS-COCO、PASCAL VOC、Visual Genome、NUS-WIDE及CUB-200-211基准数据集上的大量实验表明,本框架性能显著优于现有最优方法。代码已开源:\href{https://github.com/yu-gi-oh-leilei/TRM-ML}{\raisebox{-1pt}{\faGithub}}。