Multimodal learning is defined as learning over multiple heterogeneous input modalities such as video, audio, and text. In this work, we are concerned with understanding how models behave as the type of modalities differ between training and deployment, a situation that naturally arises in many applications of multimodal learning to hardware platforms. We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods. Further, we identify robustness short-comings of these approaches and propose two intervention techniques leading to $1.5\times$-$4\times$ robustness improvements on three datasets, AudioSet, Kinetics-400 and ImageNet-Captions. Finally, we demonstrate that these interventions better utilize additional modalities, if present, to achieve competitive results of $44.2$ mAP on AudioSet 20K.
翻译:多模态学习定义为对多种异质输入模态(如视频、音频和文本)进行学习。本研究关注当训练与部署阶段模态类型不同时模型的行为表现——这种情况在多模态学习应用于硬件平台的众多场景中自然发生。我们提出了一个多模态鲁棒性框架,用于系统性分析常见的多模态表示学习方法。进一步地,我们识别了这些方法的鲁棒性缺陷,并提出了两种干预技术,在AudioSet、Kinetics-400和ImageNet-Captions三个数据集上实现了$1.5\times$-$4\times$的鲁棒性提升。最后,我们证明这些干预措施能够更有效地利用额外模态(若存在),在AudioSet 20K上取得了$44.2$ mAP的竞争性结果。