Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition

Compared with visual signals, Inertial Measurement Units (IMUs) placed on human limbs can capture accurate motion signals while being robust to lighting variation and occlusion. While these characteristics are intuitively valuable to help egocentric action recognition, the potential of IMUs remains under-explored. In this work, we present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video. Due to the scarcity of labeled multimodal data, we design an MAE-based self-supervised pretraining method, obtaining strong multi-modal representations via modeling the natural correlation between visual and motion signals. To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices and propose to embed the relative motion features of human joints into a graph structure. Experiments show our method can achieve state-of-the-art performance on multiple public datasets. The effectiveness of our MAE-based pretraining and graph-based IMU modeling are further validated by experiments in more challenging scenarios, including partially missing IMU devices and video quality corruption, promoting more flexible usages in the real world.

翻译：与视觉信号相比，置于人体肢端的惯性测量单元（IMUs）能够捕捉精确的运动信号，同时对光照变化与遮挡具有鲁棒性。尽管这些特性在直觉上对辅助第一人称动作识别具有价值，IMU的潜力仍未得到充分探索。本研究提出一种新颖的动作识别方法，将来自身体佩戴IMU的运动数据与第一人称视频相融合。针对标注多模态数据稀缺的问题，我们设计了一种基于MAE的自监督预训练方法，通过建模视觉信号与运动信号间的自然关联来获取强健的多模态表征。为建模分布于全身的多个IMU设备间的复杂关系，我们利用多IMU设备间的协同动力学特性，提出将人体关节的相对运动特征嵌入图结构。实验表明，我们的方法在多个公共数据集上能达到最先进的性能。基于MAE的预训练方法与基于图的IMU建模策略的有效性，在更具挑战性的场景（包括部分IMU设备缺失与视频质量受损）中进一步得到验证，这推动了该方法在现实世界中更灵活的应用。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日