Deep multimodal learning has achieved great progress in recent years. However, current fusion approaches are static in nature, i.e., they process and fuse multimodal inputs with identical computation, without accounting for diverse computational demands of different multimodal data. In this work, we propose dynamic multimodal fusion (DynMM), a new approach that adaptively fuses multimodal data and generates data-dependent forward paths during inference. To this end, we propose a gating function to provide modality-level or fusion-level decisions on-the-fly based on multimodal features and a resource-aware loss function that encourages computational efficiency. Results on various multimodal tasks demonstrate the efficiency and wide applicability of our approach. For instance, DynMM can reduce the computation costs by 46.5% with only a negligible accuracy loss (CMU-MOSEI sentiment analysis) and improve segmentation performance with over 21% savings in computation (NYU Depth V2 semantic segmentation) when compared with static fusion approaches. We believe our approach opens a new direction towards dynamic multimodal network design, with applications to a wide range of multimodal tasks.
翻译:深度多模态学习近年来取得了显著进展。然而,当前的融合方法本质上是静态的,即它们以相同的计算方式处理和融合多模态输入,未能考虑不同多模态数据在计算需求上的差异性。本文提出了一种新方法——动态多模态融合(DynMM),该方法能够自适应地融合多模态数据,并在推理过程中生成依赖于数据的前向路径。为此,我们设计了一种门控函数,基于多模态特征实时提供模态级或融合级决策,同时引入一种资源感知损失函数以鼓励计算高效性。在多种多模态任务上的结果表明,我们的方法具有高效性和广泛适用性。例如,与静态融合方法相比,DynMM在仅造成极小精度损失(CMU-MOSEI情感分析任务)的情况下,可将计算成本降低46.5%,并在NYU Depth V2语义分割任务中,以超过21%的计算节省提升分割性能。我们相信,该方法为动态多模态网络设计开辟了新方向,可应用于广泛的多模态任务。