Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the inherent ambiguity arising from multiplicity and imperfect annotations. Deterministic functions are not sufficiently powerful to capture ambiguity, prompting the exploration of probabilistic embeddings to tackle the challenge. However, the existing probabilistic ITM approach encounters two key shortcomings; the burden of heavy computations due to the Monte Carlo approximation, and the loss saturation issue in the face of abundant false negatives. To overcome the issues, this paper presents an improved Probabilistic Cross-Modal Embeddings (named PCME++) by introducing a new probabilistic distance with a closed-form solution. In addition, two optimization techniques are proposed to enhance PCME++ further; first, the incorporation of pseudo-positives to prevent the loss saturation problem under massive false negatives; second, mixed sample data augmentation for probabilistic matching. Experimental results on MS-COCO Caption and two extended benchmarks, CxC and ECCV Caption, demonstrate the effectiveness of PCME++ compared to state-of-the-art ITM methods. The robustness of PCME++ is also evaluated under noisy image-text correspondences. In addition, the potential applicability of PCME++ in automatic prompt tuning for zero-shot classification is shown. The code is available at https://naver-ai.github.io/pcmepp/.
翻译:图像-文本匹配(ITM)作为一项基础的视觉-语言(VL)任务,因多重性与不完美标注而产生的固有歧义而面临挑战。确定性函数不足以有效捕捉这种歧义,促使研究者探索概率性嵌入来应对这一难题。然而,现有的概率性ITM方法存在两个关键缺陷:因蒙特卡洛近似导致的沉重计算负担,以及面对大量假负例时的损失饱和问题。为解决这些问题,本文提出了一种改进的概率性跨模态嵌入方法(称为PCME++),通过引入具有闭合解的新型概率距离。此外,提出了两种优化技术以进一步增强PCME++:首先,引入伪正例以防止在大量假负例下的损失饱和问题;其次,针对概率性匹配设计混合样本数据增强方法。在MS-COCO Caption及两个扩展基准数据集(CxC和ECCV Caption)上的实验结果表明,PCME++相较于最先进的ITM方法具有有效性。同时评估了PCME++在噪声图像-文本对应下的鲁棒性。此外,展示了PCME++在零样本分类自动提示调优中的潜在应用性。代码已开源至 https://naver-ai.github.io/pcmepp/。