Current image-text retrieval methods have demonstrated impressive performance in recent years. However, they still face two problems: the inter-modal matching missing problem and the intra-modal semantic loss problem. These problems can significantly affect the accuracy of image-text retrieval. To address these challenges, we propose a novel method called Cross-modal and Uni-modal Soft-label Alignment (CUSA). Our method leverages the power of uni-modal pre-trained models to provide soft-label supervision signals for the image-text retrieval model. Additionally, we introduce two alignment techniques, Cross-modal Soft-label Alignment (CSA) and Uni-modal Soft-label Alignment (USA), to overcome false negatives and enhance similarity recognition between uni-modal samples. Our method is designed to be plug-and-play, meaning it can be easily applied to existing image-text retrieval models without changing their original architectures. Extensive experiments on various image-text retrieval models and datasets, we demonstrate that our method can consistently improve the performance of image-text retrieval and achieve new state-of-the-art results. Furthermore, our method can also boost the uni-modal retrieval performance of image-text retrieval models, enabling it to achieve universal retrieval. The code and supplementary files can be found at https://github.com/lerogo/aaai24_itr_cusa.
翻译:当前图像-文本检索方法虽在近年来展现出卓越性能,但仍面临两大问题:跨模态匹配缺失问题与单模态语义损失问题。这些问题会显著影响图像-文本检索的准确性。为应对上述挑战,我们提出一种名为跨模态与单模态软标签对齐(CUSA)的新方法。该方法利用单模态预训练模型提供软标签监督信号,以增强图像-文本检索模型的能力。此外,我们引入跨模态软标签对齐(CSA)与单模态软标签对齐(USA)两种对齐技术,以克服假负例问题并提升单模态样本间的相似性识别能力。本方法具有即插即用特性,可轻松应用于现有图像-文本检索模型而无需改变其原始架构。通过在多种图像-文本检索模型与数据集上的大量实验,我们证明所提方法能持续提升图像-文本检索性能,并取得最新最优结果。此外,该方法还能提升图像-文本检索模型的单模态检索性能,使其实现通用检索。代码与补充文件见https://github.com/lerogo/aaai24_itr_cusa。