Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space. However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity. Concretely, we first construct a set of various learnable prototypes for each modality to represent the entire semantics subspace. Then Dempster-Shafer Theory and Subjective Logic Theory are utilized to build an evidential theoretical framework by associating evidence with Dirichlet Distribution parameters. The PAU model induces accurate uncertainty and reliable predictions for cross-modal retrieval. Extensive experiments are performed on four major benchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the effectiveness of our method. The code is accessible at https://github.com/leolee99/PAU.
翻译:跨模态检索方法通过联合学习共同的表示空间,构建视觉与语言模态之间的相似关系。然而,由于低质量数据(如损坏图像、快速变化视频及非详细文本)引发的偶然不确定性,预测结果往往不可靠。本文提出一种新颖的基于原型的偶然不确定性量化(PAU)框架,通过量化数据内在歧义产生的不确定性,提供可信赖的预测。具体而言,我们首先为每个模态构建一组可学习的多样化原型,以表征完整语义子空间;继而利用Dempster-Shafer理论和主观逻辑理论,通过将证据与狄利克雷分布参数关联,建立证据理论框架。PAU模型能够为跨模态检索生成准确的不确定性估计与可靠预测。在MSR-VTT、MSVD、DiDeMo和MS-COCO四个主流基准数据集上的大量实验证明了该方法的有效性。代码见https://github.com/leolee99/PAU。