In order to perform multimodal fusion of heterogeneous signals, we need to understand their interactions: how each modality individually provides information useful for a task and how this information changes in the presence of other modalities. In this paper, we perform a comparative study of how humans annotate two categorizations of multimodal interactions: (1) partial labels, where different annotators annotate the label given the first, second, and both modalities, and (2) counterfactual labels, where the same annotator annotates the label given the first modality before asking them to explicitly reason about how their answer changes when given the second. We further propose an alternative taxonomy based on (3) information decomposition, where annotators annotate the degrees of redundancy: the extent to which modalities individually and together give the same predictions, uniqueness: the extent to which one modality enables a prediction that the other does not, and synergy: the extent to which both modalities enable one to make a prediction that one would not otherwise make using individual modalities. Through experiments and annotations, we highlight several opportunities and limitations of each approach and propose a method to automatically convert annotations of partial and counterfactual labels to information decomposition, yielding an accurate and efficient method for quantifying multimodal interactions.
翻译:为了实现异质信号的多模态融合,我们需要理解其交互机制:每个模态如何单独提供对任务有用的信息,以及这些信息在其他模态存在时如何变化。本文开展了比较研究,探讨人类如何标注两类多模态交互分类:(1)部分标签——不同标注者分别基于第一模态、第二模态及双模态组合进行标注;(2)反事实标签——同一标注者先基于第一模态进行标注,随后要求其明确推理在加入第二模态后答案如何变化。我们进一步提出基于(3)信息分解的替代分类方法——标注者需标注冗余度(各模态单独及联合时给出相同预测的程度)、独特性(某模态能实现另一模态无法实现的预测的程度)以及协同性(双模态联合能实现单一模态无法独立完成的预测的程度)。通过实验与标注,我们揭示了每种方法的若干优势与局限,并提出将部分标签与反事实标签自动转换为信息分解标注的方法,从而建立高效准确的多模态交互量化方案。