With the exponential growth of multimedia data, leveraging multimodal sensors presents a promising approach for improving accuracy in human activity recognition. Nevertheless, accurately identifying these activities using both video data and wearable sensor data presents challenges due to the labor-intensive data annotation, and reliance on external pretrained models or additional data. To address these challenges, we introduce Multimodal Masked Autoencoders-Based One-Shot Learning (Mu-MAE). Mu-MAE integrates a multimodal masked autoencoder with a synchronized masking strategy tailored for wearable sensors. This masking strategy compels the networks to capture more meaningful spatiotemporal features, which enables effective self-supervised pretraining without the need for external data. Furthermore, Mu-MAE leverages the representation extracted from multimodal masked autoencoders as prior information input to a cross-attention multimodal fusion layer. This fusion layer emphasizes spatiotemporal features requiring attention across different modalities while highlighting differences from other classes, aiding in the classification of various classes in metric-based one-shot learning. Comprehensive evaluations on MMAct one-shot classification show that Mu-MAE outperforms all the evaluated approaches, achieving up to an 80.17% accuracy for five-way one-shot multimodal classification, without the use of additional data.
翻译:随着多媒体数据的指数级增长,利用多模态传感器为提高人类活动识别的准确性提供了一种有前景的途径。然而,由于数据标注的劳动密集性以及对外部预训练模型或额外数据的依赖,同时利用视频数据和可穿戴传感器数据来准确识别这些活动仍面临挑战。为应对这些挑战,我们提出了基于多模态掩码自编码器的小样本学习(Mu-MAE)。Mu-MAE集成了一个多模态掩码自编码器,并采用专为可穿戴传感器设计的同步掩码策略。该掩码策略迫使网络捕获更具意义的时空特征,从而实现了无需外部数据的有效自监督预训练。此外,Mu-MAE将从多模态掩码自编码器提取的表征作为先验信息,输入到一个跨注意力多模态融合层。该融合层强调不同模态间需要关注的时空特征,同时突出与其他类别的差异,从而有助于在基于度量的小样本学习中对各类别进行分类。在MMAct小样本分类任务上的综合评估表明,Mu-MAE优于所有被评估的方法,在不使用额外数据的情况下,在五路单样本多模态分类中达到了80.17%的准确率。