Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the inherent ambiguity arising from multiplicity and imperfect annotations. Deterministic functions are not sufficiently powerful to capture ambiguity, prompting the exploration of probabilistic embeddings to tackle the challenge. However, the existing probabilistic ITM approach encounters two key shortcomings; the burden of heavy computations due to the Monte Carlo approximation, and the loss saturation issue in the face of abundant false negatives. To overcome the issues, this paper presents an improved Probabilistic Cross-Modal Embeddings (named PCME++) by introducing a new probabilistic distance with a closed-form solution. In addition, two optimization techniques are proposed to enhance PCME++ further; first, the incorporation of pseudo-positives to prevent the loss saturation problem under massive false negatives; second, mixed sample data augmentation for probabilistic matching. Experimental results on MS-COCO Caption and two extended benchmarks, CxC and ECCV Caption, demonstrate the effectiveness of PCME++ compared to state-of-the-art ITM methods. The robustness of PCME++ is also evaluated under noisy image-text correspondences. In addition, the potential applicability of PCME++ in automatic prompt tuning for zero-shot classification is shown. The code is available at https://naver-ai.github.io/pcmepp/.
翻译:图像-文本匹配(ITM)任务作为一项基础的视觉-语言(VL)任务,由于多样性和不完善标注而存在固有的歧义性。确定性函数不足以有效捕捉这种歧义性,促使研究者探索概率性嵌入以应对这一挑战。然而,现有的概率性ITM方法存在两个关键缺点:蒙特卡洛近似带来的高计算负担,以及面对大量假负例时的损失饱和问题。为解决这些问题,本文提出一种改进的概率性跨模态嵌入方法(命名为PCME++),引入具有闭式解的新概率距离。此外,提出两种优化技术以进一步增强PCME++:其一,引入伪正例以防止大量假负例下的损失饱和问题;其二,针对概率匹配的混合样本数据增强。在MS-COCO Caption及两个扩展基准CxC和ECCV Caption上的实验结果表明,与最先进的ITM方法相比,PCME++的有效性得到了验证。同时,在噪声图文对应情况下评估了PCME++的鲁棒性。此外,展示了PCME++在零样本分类自动提示调优中的潜在适用性。代码开源地址:https://naver-ai.github.io/pcmepp/。