Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our interpretation, including inter- and intra-modal uncertainty. Little effort has studied the modeling of this uncertainty, particularly in pre-training on unlabeled datasets and fine-tuning in task-specific downstream datasets. In this paper, we project the representations of all modalities as probabilistic distributions via a Probability Distribution Encoder (PDE) by utilizing sequence-level interactions. Compared to the existing deterministic methods, such uncertainty modeling can convey richer multimodal semantic information and more complex relationships. Furthermore, we integrate uncertainty modeling with popular pre-training frameworks and propose suitable pre-training tasks: Distribution-based Vision-Language Contrastive learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM). The fine-tuned models are applied to challenging downstream tasks, including image-text retrieval, visual question answering, visual reasoning, and visual entailment, and achieve state-of-the-art results.
翻译:摘要:多模态语义理解通常需要处理不确定性,即所获取的信息往往指向多个目标。这种不确定性(包括模态间和模态内不确定性)会干扰我们的语义解读。现有研究鲜少涉及此类不确定性的建模,尤其是在无标签数据集预训练和任务特定下游数据集微调的过程中。本文通过概率分布编码器(PDE)利用序列级交互,将所有模态的表征投影为概率分布。与现有确定性方法相比,这种不确定性建模能够传达更丰富的多模态语义信息和更复杂的关联关系。进一步地,我们将不确定性建模与主流预训练框架相结合,提出了适配的预训练任务:基于分布的视觉-语言对比学习(D-VLC)、基于分布的掩码语言建模(D-MLM)和基于分布的图像-文本匹配(D-ITM)。微调后的模型被应用于图像-文本检索、视觉问答、视觉推理和视觉蕴含等具有挑战性的下游任务,并在这些任务上取得了最先进的结果。