In a wide range of multimodal tasks, contrastive learning has become a particularly appealing approach since it can successfully learn representations from abundant unlabeled data with only pairing information (e.g., image-caption or video-audio pairs). Underpinning these approaches is the assumption of multi-view redundancy - that shared information between modalities is necessary and sufficient for downstream tasks. However, in many real-world settings, task-relevant information is also contained in modality-unique regions: information that is only present in one modality but still relevant to the task. How can we learn self-supervised multimodal representations to capture both shared and unique information relevant to downstream tasks? This paper proposes FactorCL, a new multimodal representation learning method to go beyond multi-view redundancy. FactorCL is built from three new contributions: (1) factorizing task-relevant information into shared and unique representations, (2) capturing task-relevant information via maximizing MI lower bounds and removing task-irrelevant information via minimizing MI upper bounds, and (3) multimodal data augmentations to approximate task relevance without labels. On large-scale real-world datasets, FactorCL captures both shared and unique information and achieves state-of-the-art results on six benchmarks.
翻译:在广泛的多模态任务中,对比学习已成为一种极具吸引力的方法,因为它能够仅利用配对信息(如图像-文本或视频-音频对)从大量无标注数据中成功学习表征。这些方法的核心假设是多视图冗余——即模态间共享信息对下游任务既必要又充分。然而,在许多现实场景中,任务相关信息也包含在模态独有区域中:即仅存在于一种模态中但与任务仍相关的信息。我们如何学习自监督多模态表征以捕获与下游任务相关的共享和独特信息?本文提出FactorCL,一种超越多视图冗余的新型多模态表征学习方法。FactorCL基于三项新贡献:(1)将任务相关信息分解为共享表征和独特表征,(2)通过最大化互信息下界捕获任务相关信息,并通过最小化互信息上界消除任务无关信息,(3)多模态数据增强以在无标签情况下近似任务相关性。在大规模真实世界数据集上,FactorCL能够同时捕获共享和独特信息,并在六个基准测试中达到最先进结果。