Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the inherent ambiguity arising from multiplicity and imperfect annotations. Deterministic functions are not sufficiently powerful to capture ambiguity, prompting the exploration of probabilistic embeddings to tackle the challenge. However, the existing probabilistic ITM approach encounters two key shortcomings; the burden of heavy computations due to the Monte Carlo approximation, and the loss saturation issue in the face of abundant false negatives. To overcome the issues, this paper presents an improved Probabilistic Cross-Modal Embeddings (named PCME++) by introducing a new probabilistic distance with a closed-form solution. In addition, two optimization techniques are proposed to enhance PCME++ further: first, the incorporation of pseudo-positives to prevent the loss saturation problem under massive false negatives; second, mixed sample data augmentation for probabilistic matching. Experimental results on MS-COCO Caption and two extended benchmarks, CxC and ECCV Caption, demonstrate the effectiveness of PCME++ compared to state-of-the-art ITM methods. The robustness of PCME++ is also evaluated under noisy image-text correspondences. In addition, the potential applicability of PCME++ in automatic prompt tuning for zero-shot classification is shown. The code is available at https://github.com/naver-ai/pcmepp.
翻译:图像-文本匹配(ITM)作为基础视觉-语言(VL)任务,面临由多义性和不完善标注导致的固有歧义问题。确定性函数不足以捕获这种歧义,促使研究者探索概率性嵌入方法应对该挑战。然而,现有概率性ITM方法存在两个关键缺陷:蒙特卡洛近似带来的高计算负担,以及面对大量假负例时的损失饱和问题。为克服上述问题,本文提出改进的概率性跨模态嵌入方法(命名为PCME++),通过引入具有闭式解的新型概率距离。此外,提出两种优化技术进一步增强PCME++:其一,引入伪正例以避免大量假负例下的损失饱和问题;其二,针对概率匹配的混合样本数据增强。在MS-COCO Caption及其两个扩展基准(CxC和ECCV Caption)上的实验结果表明,PCME++相较于最先进的ITM方法具有优越性。同时评估了PCME++在含噪图文对应场景下的鲁棒性,并展示了其在零样本分类自动提示词调优中的潜在应用价值。代码已开源至https://github.com/naver-ai/pcmepp。