Distilled Mid-Fusion Transformer Networks for Multi-Modal Human Activity Recognition

Human Activity Recognition is an important task in many human-computer collaborative scenarios, whilst having various practical applications. Although uni-modal approaches have been extensively studied, they suffer from data quality and require modality-specific feature engineering, thus not being robust and effective enough for real-world deployment. By utilizing various sensors, Multi-modal Human Activity Recognition could utilize the complementary information to build models that can generalize well. While deep learning methods have shown promising results, their potential in extracting salient multi-modal spatial-temporal features and better fusing complementary information has not been fully explored. Also, reducing the complexity of the multi-modal approach for edge deployment is another problem yet to resolve. To resolve the issues, a knowledge distillation-based Multi-modal Mid-Fusion approach, DMFT, is proposed to conduct informative feature extraction and fusion to resolve the Multi-modal Human Activity Recognition task efficiently. DMFT first encodes the multi-modal input data into a unified representation. Then the DMFT teacher model applies an attentive multi-modal spatial-temporal transformer module that extracts the salient spatial-temporal features. A temporal mid-fusion module is also proposed to further fuse the temporal features. Then the knowledge distillation method is applied to transfer the learned representation from the teacher model to a simpler DMFT student model, which consists of a lite version of the multi-modal spatial-temporal transformer module, to produce the results. Evaluation of DMFT was conducted on two public multi-modal human activity recognition datasets with various state-of-the-art approaches. The experimental results demonstrate that the model achieves competitive performance in terms of effectiveness, scalability, and robustness.

翻译：人体活动识别是许多人机协作场景中的关键任务，同时具有多种实际应用。尽管单模态方法已被广泛研究，但其受数据质量影响且需要针对特定模态的特征工程，因此在真实部署中不够鲁棒和高效。通过利用多种传感器，多模态人体活动识别能够利用互补信息构建具有良好泛化能力的模型。尽管深度学习方法已展现出令人振奋的结果，但其在提取显著多模态时空特征及更好融合互补信息方面的潜力尚未被充分挖掘。此外，降低多模态方法在边缘部署中的复杂度是另一个亟待解决的问题。为解决上述问题，本文提出了一种基于知识蒸馏的多模态中间融合方法DMFT，用于高效提取和融合信息特征，以完成多模态人体活动识别任务。DMFT首先将多模态输入数据编码为统一表示，随后DMFT教师模型应用注意力型多模态时空Transformer模块提取显著时空特征，并提出时间中间融合模块以进一步融合时序特征。最后采用知识蒸馏方法将教师模型学到的表示迁移至更简洁的DMFT学生模型（该模型由轻量版多模态时空Transformer模块组成）以生成结果。在公开的多模态人体活动识别数据集上，将DMFT与多种先进方法进行对比评估。实验结果表明，该模型在有效性、可扩展性和鲁棒性方面均具有竞争性表现。