Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help egocentric action recognition, the potential of IMUs remains under-explored. In this work, we present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video. Due to the scarcity of labeled multimodal data, we design an MAE-based self-supervised pretraining method, obtaining strong multi-modal representations via modeling the natural correlation between visual and motion signals. To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices and propose to embed the relative motion features of human joints into a graph structure. Experiments show our method can achieve state-of-the-art performance on multiple public datasets. The effectiveness of our MAE-based pretraining and graph-based IMU modeling are further validated by experiments in more challenging scenarios, including partially missing IMU devices and video quality corruption, promoting more flexible usages in the real world.
翻译:与视觉信号相比,置于人体肢端的惯性测量单元(IMUs)能够捕捉精确的运动信号,同时对光照变化与遮挡具有鲁棒性。尽管这些特性在直觉上对辅助第一人称动作识别具有价值,IMU的潜力仍未得到充分探索。本研究提出一种新颖的动作识别方法,将来自身体佩戴IMU的运动数据与第一人称视频相融合。针对标注多模态数据稀缺的问题,我们设计了一种基于MAE的自监督预训练方法,通过建模视觉信号与运动信号间的自然关联来获取强健的多模态表征。为建模分布于全身的多个IMU设备间的复杂关系,我们利用多IMU设备间的协同动力学特性,提出将人体关节的相对运动特征嵌入图结构。实验表明,我们的方法在多个公共数据集上能达到最先进的性能。基于MAE的预训练方法与基于图的IMU建模策略的有效性,在更具挑战性的场景(包括部分IMU设备缺失与视频质量受损)中进一步得到验证,这推动了该方法在现实世界中更灵活的应用。