Standard multi-modal models assume the use of the same modalities in training and inference stages. However, in practice, the environment in which multi-modal models operate may not satisfy such assumption. As such, their performances degrade drastically if any modality is missing in the inference stage. We ask: how can we train a model that is robust to missing modalities? This paper seeks a set of good practices for multi-modal action recognition, with a particular interest in circumstances where some modalities are not available at an inference time. First, we study how to effectively regularize the model during training (e.g., data augmentation). Second, we investigate on fusion methods for robustness to missing modalities: we find that transformer-based fusion shows better robustness for missing modality than summation or concatenation. Third, we propose a simple modular network, ActionMAE, which learns missing modality predictive coding by randomly dropping modality features and tries to reconstruct them with the remaining modality features. Coupling these good practices, we build a model that is not only effective in multi-modal action recognition but also robust to modality missing. Our model achieves the state-of-the-arts on multiple benchmarks and maintains competitive performances even in missing modality scenarios. Codes are available at https://github.com/sangminwoo/ActionMAE.
翻译:标准多模态模型假设在训练和推理阶段使用相同的模态。然而,在实际应用中,多模态模型运行的环境可能不满足这一假设。因此,如果在推理阶段任何模态缺失,其性能会急剧下降。我们提出疑问:如何训练一个对缺失模态鲁棒的模型?本文旨在探索一套针对多模态动作识别的良好实践,特别关注推理时某些模态不可用的情况。首先,我们研究了如何在训练过程中有效正则化模型(例如,数据增强)。其次,我们探究了针对缺失模态鲁棒性的融合方法:发现基于Transformer的融合相比求和或拼接表现出更好的缺失模态鲁棒性。第三,我们提出了一种简单的模块化网络ActionMAE,它通过学习随机丢弃模态特征并尝试用剩余模态特征重构它们,实现缺失模态预测编码。结合这些良好实践,我们构建了一个不仅在多模态动作识别中有效,而且对模态缺失鲁棒的模型。我们的模型在多个基准测试中达到了最先进性能,即使在模态缺失场景下也能保持有竞争力的表现。代码可在https://github.com/sangminwoo/ActionMAE获取。