In many machine learning systems that jointly learn from multiple modalities, a core research question is to understand the nature of multimodal interactions: the emergence of new task-relevant information during learning from both modalities that was not present in either alone. We study this challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data and naturally co-occurring multimodal data (e.g., unlabeled images and captions, video and corresponding audio) but when labeling them is time-consuming. Using a precise information-theoretic definition of interactions, our key contributions are the derivations of lower and upper bounds to quantify the amount of multimodal interactions in this semi-supervised setting. We propose two lower bounds based on the amount of shared information between modalities and the disagreement between separately trained unimodal classifiers, and derive an upper bound through connections to approximate algorithms for min-entropy couplings. We validate these estimated bounds and show how they accurately track true interactions. Finally, two semi-supervised multimodal applications are explored based on these theoretical results: (1) analyzing the relationship between multimodal performance and estimated interactions, and (2) self-supervised learning that embraces disagreement between modalities beyond agreement as is typically done.
翻译:在许多联合学习多种模态的机器学习系统中,一个核心研究问题是理解多模态交互的本质:即在从两种模态共同学习的过程中,涌现出单独任一模态所不具备的与任务相关的新信息。我们研究了在仅有标注单模态数据和自然共现的多模态数据(例如未标注的图像与描述文本、视频与对应音频)但因标注成本高昂而难以获取标签的半监督场景下,这一交互量化挑战。基于精确的信息论交互定义,我们的核心贡献在于推导了该半监督设置下多模态交互量的下界与上界。我们提出了两个下界:一个基于模态间的共享信息量,另一个基于独立训练的单模态分类器之间的不一致性;并通过与最小熵耦合近似算法的关联推导出上界。我们验证了这些估计边界,并展示了它们如何准确追踪真实交互量。最后,基于这些理论结果探索了两个半监督多模态应用:(1)分析多模态性能与估计交互量之间的关系;(2)自监督学习不仅采用常规的模态间一致性,还吸收模态间的不一致性。