Effective human behavior modeling requires a representation of the human body movement that capitalizes on its compositionality. We propose a hierarchical representation consisting of Action Atoms that capture the atomic joint movements and Action Motifs that are formed by their temporal compositions and encode similar body movements found across different overall human actions. We derive A4Mer, a nested latent Transformer to learn this hierarchical representation from human pose data in a fully self-supervised manner. A4Mer splits a 3D pose sequence into variable-length segments and represents each segment as a single latent token (Action Atoms). Through bottom-up representation learning, temporal patterns composed of these Action Atoms, which capture meaningful temporal spans of reusable, semantic segments of body movements, naturally emerge (Action Motifs). A4Mer achieves this with a unified pretext task of masked token prediction in their respective latent spaces. We also introduce Action Motif Dataset (AMD), a large-scale dataset of multi-view human behavior videos with full SMPL annotations. We introduce a novel use of cameras by mounting them on the feet to achieve their frame-wise annotations despite frequent and heavy body occlusions. Experimental results demonstrate the effectiveness of A4Mer for extracting meaningful Action Motifs, which significantly benefit human behavior modeling tasks including action recognition, motion prediction, and motion interpolation.
翻译:有效的人类行为建模需要一种能够利用人体运动组合性的表征。我们提出了一种层次化表征,包括捕捉原子关节运动的"动作原子"(Action Atoms)以及由其时间组合形成、编码不同整体人类动作中相似身体运动的"动作基序"(Action Motifs)。我们设计了A4Mer——一种嵌套式潜在Transformer,以完全自监督的方式从人体姿态数据中学习这种层次化表示。A4Mer将3D姿态序列分割为可变长度片段,并将每个片段表示为单个潜在令牌(动作原子)。通过自底向上的表示学习,由这些动作原子组成的时序模式(捕捉可重用、语义化身体运动片段的有意义时间跨度)会自然涌现为动作基序。A4Mer通过统一的自监督预训练任务(在对应潜在空间中进行掩码令牌预测)实现上述过程。我们还提出了动作基序数据集(AMD),这是一个包含多视角人类行为视频和完整SMPL标注的大规模数据集。我们创新性地将摄像机安装在脚部,从而在频繁且严重的身体遮挡情况下仍能实现帧级标注。实验结果表明,A4Mer能有效提取有意义的动作基序,显著提升动作识别、运动预测和运动插值等人类行为建模任务的表现。