Human activity recognition (HAR) has been playing an increasingly important role in various domains such as healthcare, security monitoring, and metaverse gaming. Though numerous HAR methods based on computer vision have been developed to show prominent performance, they still suffer from poor robustness in adverse visual conditions in particular low illumination, which motivates WiFi-based HAR to serve as a good complementary modality. Existing solutions using WiFi and vision modalities rely on massive labeled data that are very cumbersome to collect. In this paper, we propose a novel unsupervised multimodal HAR solution, MaskFi, that leverages only unlabeled video and WiFi activity data for model training. We propose a new algorithm, masked WiFi-vision modeling (MI2M), that enables the model to learn cross-modal and single-modal features by predicting the masked sections in representation learning. Benefiting from our unsupervised learning procedure, the network requires only a small amount of annotated data for finetuning and can adapt to the new environment with better performance. We conduct extensive experiments on two WiFi-vision datasets collected in-house, and our method achieves human activity recognition and human identification in terms of both robustness and accuracy.
翻译:人体动作识别(HAR)在医疗健康、安防监控及元宇宙游戏等领域正发挥着日益重要的作用。尽管基于计算机视觉的众多HAR方法已展现出卓越性能,但在低照度等恶劣视觉条件下仍存在稳健性不足的问题,这促使基于WiFi的HAR成为理想的互补模态。现有融合WiFi与视觉模态的解决方案依赖大量标注数据,而此类数据的采集极为繁琐。本文提出一种新型无监督多模态HAR方案MaskFi,仅利用无标签视频与WiFi动作数据即可完成模型训练。我们提出掩码WiFi-视觉建模(MI2M)算法,通过预测表征学习中的掩码片段,使模型能够同时学习跨模态与单模态特征。得益于无监督学习流程,该网络仅需少量标注数据微调即可适应新环境并取得更优性能。我们在自建的两个WiFi-视觉数据集上开展大量实验,结果表明所提方法在人体动作识别与身份识别方面兼具鲁棒性与准确性。