In multi-modal action recognition, it is important to consider not only the complementary nature of different modalities but also global action content. In this paper, we propose a novel network, named Modality Mixer (M-Mixer) network, to leverage complementary information across modalities and temporal context of an action for multi-modal action recognition. We also introduce a simple yet effective recurrent unit, called Multi-modal Contextualization Unit (MCU), which is a core component of M-Mixer. Our MCU temporally encodes a sequence of one modality (e.g., RGB) with action content features of other modalities (e.g., depth, IR). This process encourages M-Mixer to exploit global action content and also to supplement complementary information of other modalities. As a result, our proposed method outperforms state-of-the-art methods on NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets. Moreover, we demonstrate the effectiveness of M-Mixer by conducting comprehensive ablation studies.
翻译:在多模态动作识别中,不仅要考虑不同模态的互补性,还需关注全局动作内容。本文提出一种名为模态混合器(M-Mixer)的新型网络,旨在利用跨模态的互补信息及动作的时间上下文实现多模态动作识别。同时,我们引入一个简单而有效的循环单元——多模态上下文化单元(MCU),作为M-Mixer的核心组件。该MCU将某一模态(如RGB)的序列与其他模态(如深度、红外)的动作内容特征进行时序编码,这一过程促使M-Mixer既能挖掘全局动作内容,又能补充其他模态的互补信息。实验结果表明,所提方法在NTU RGB+D 60、NTU RGB+D 120及NW-UCLA数据集上均优于现有最先进方法。此外,通过全面的消融研究验证了M-Mixer的有效性。