Supervised multi-modal learning involves mapping multiple modalities to a target label. Previous studies in this field have concentrated on capturing in isolation either the inter-modality dependencies (the relationships between different modalities and the label) or the intra-modality dependencies (the relationships within a single modality and the label). We argue that these conventional approaches that rely solely on either inter- or intra-modality dependencies may not be optimal in general. We view the multi-modal learning problem from the lens of generative models where we consider the target as a source of multiple modalities and the interaction between them. Towards that end, we propose inter- & intra-modality modeling (I2M2) framework, which captures and integrates both the inter- and intra-modality dependencies, leading to more accurate predictions. We evaluate our approach using real-world healthcare and vision-and-language datasets with state-of-the-art models, demonstrating superior performance over traditional methods focusing only on one type of modality dependency.
翻译:监督式多模态学习涉及将多种模态映射到目标标签。该领域的先前研究主要集中于孤立地捕捉跨模态依赖关系(不同模态与标签之间的关系)或模态内依赖关系(单一模态内部与标签之间的关系)。我们认为,这种仅依赖跨模态或模态内单一依赖关系的传统方法通常并非最优解。我们从生成模型的视角审视多模态学习问题,将目标视为多种模态及其相互作用的来源。为此,我们提出跨模态与模态内联合建模框架,该框架能够捕捉并整合跨模态与模态内依赖关系,从而获得更精准的预测结果。我们在真实世界医疗健康及视觉-语言数据集上,基于前沿模型对方法进行评估,结果表明其性能显著优于仅关注单一模态依赖类型的传统方法。