Cross-modal Retrieval methods build similarity relations between vision and language modalities by jointly learning a common representation space. However, the predictions are often unreliable due to the Aleatoric uncertainty, which is induced by low-quality data, e.g., corrupt images, fast-paced videos, and non-detailed texts. In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity. Concretely, we first construct a set of various learnable prototypes for each modality to represent the entire semantics subspace. Then Dempster-Shafer Theory and Subjective Logic Theory are utilized to build an evidential theoretical framework by associating evidence with Dirichlet Distribution parameters. The PAU model induces accurate uncertainty and reliable predictions for cross-modal retrieval. Extensive experiments are performed on four major benchmark datasets of MSR-VTT, MSVD, DiDeMo, and MS-COCO, demonstrating the effectiveness of our method. The code is accessible at https://github.com/leolee99/PAU.
翻译:跨模态检索方法通过联合学习公共表示空间构建视觉与语言模态之间的相似性关系。然而,由于低质量数据(如损坏图像、快节奏视频和简略文本)引发的偶然不确定性,预测结果往往不可靠。本文提出一种新颖的基于原型的偶然不确定性量化(PAU)框架,通过量化由固有数据模糊性产生的不确定性来提供可信预测。具体而言,我们首先为每种模态构建一组可学习的多样化原型,以表征完整的语义子空间。随后利用登普斯特-沙弗理论和主观逻辑理论,通过将证据与狄利克雷分布参数相关联建立证据理论框架。PAU模型为跨模态检索生成精确的不确定性和可靠的预测结果。我们在MSR-VTT、MSVD、DiDeMo和MS-COCO四个主流基准数据集上进行了广泛实验,证明了该方法的有效性。代码已开源:https://github.com/leolee99/PAU。