People perceive the world with different senses, such as sight, hearing, smell, and touch. Processing and fusing information from multiple modalities enables Artificial Intelligence to understand the world around us more easily. However, when there are missing modalities, the number of available modalities is different in diverse situations, which leads to an N-to-One fusion problem. To solve this problem, we propose a self-attention based fusion block called SFusion. Different from preset formulations or convolution based methods, the proposed block automatically learns to fuse available modalities without synthesizing or zero-padding missing ones. Specifically, the feature representations extracted from upstream processing model are projected as tokens and fed into self-attention module to generate latent multimodal correlations. Then, a modal attention mechanism is introduced to build a shared representation, which can be applied by the downstream decision model. The proposed SFusion can be easily integrated into existing multimodal analysis networks. In this work, we apply SFusion to different backbone networks for human activity recognition and brain tumor segmentation tasks. Extensive experimental results show that the SFusion block achieves better performance than the competing fusion strategies. Our code is available at https://github.com/scut-cszcl/SFusion.
翻译:[translated abstract in Chinese]
人类通过视觉、听觉、嗅觉和触觉等多种感官感知世界。处理并融合来自多种模态的信息,能使人工智能更容易理解我们周围的物理世界。然而,当存在模态缺失时,不同场景下可用模态的数量各不相同,这导致了一个N对一的融合问题。为解决该问题,本文提出了一种基于自注意力的融合模块SFusion。与预设融合规则或基于卷积的方法不同,该模块能够自动学习如何融合可用模态,而无需合成或零填充缺失模态。具体而言,从上游处理模型提取的特征表示被投影为令牌,并输入自注意力模块以生成潜在的多模态相关性。随后,引入模态注意力机制构建共享表示,供下游决策模型使用。所提出的SFusion模块可轻易集成到现有模态分析网络中。在本工作中,我们将SFusion应用于不同主干网络,用于人体活动识别和脑肿瘤分割任务。大量实验结果表明,SFusion模块在性能上优于其他竞争性融合策略。我们的代码开源地址为https://github.com/scut-cszcl/SFusion。