Multimodal learning is defined as learning over multiple heterogeneous input modalities such as video, audio, and text. In this work, we are concerned with understanding how models behave as the type of modalities differ between training and deployment, a situation that naturally arises in many applications of multimodal learning to hardware platforms. We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods. Further, we identify robustness short-comings of these approaches and propose two intervention techniques leading to $1.5\times$-$4\times$ robustness improvements on three datasets, AudioSet, Kinetics-400 and ImageNet-Captions. Finally, we demonstrate that these interventions better utilize additional modalities, if present, to achieve competitive results of $44.2$ mAP on AudioSet 20K.
翻译:多模态学习被定义为在多种异构输入模态(如视频、音频和文本)上进行的学习。本研究关注当训练与部署阶段的模态类型存在差异时模型的行为表现——这种情形在多模态学习应用于硬件平台的诸多场景中自然产生。我们提出了一个多模态鲁棒性框架,用于系统分析常见的多模态表示学习方法。在此基础上,我们识别了这些方法存在的鲁棒性缺陷,并提出两种干预技术,在AudioSet、Kinetics-400和ImageNet-Captions三个数据集上实现了$1.5\times$-$4\times$的鲁棒性提升。最后,我们证明这些干预措施能够更有效地利用额外模态(若存在),在AudioSet 20K上取得了$44.2$ mAP的竞争性结果。