We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-frame self-attention. The attention mechanism allows the model to freely attend between any two vertex patches to learn non-local relationships in the spatial-temporal domain. Masked vertex modeling and future frame prediction are used as two self-supervised tasks to fully activate the bi-directional and auto-regressive attention in our hierarchical transformer. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is available at https://github.com/zgzxy001/STMT.
翻译:我们研究利用运动捕捉(MoCap)序列进行人体动作识别的问题。不同于现有技术需通过多个手动步骤提取标准化骨架表示作为模型输入,我们提出了一种新颖的时空网格Transformer(STMT)直接对网格序列建模。该模型采用层级式Transformer架构,其中包含帧内偏移注意力与帧间自注意力机制。这种注意力机制能使模型自由关注任意两个顶点补丁,从而在时空域中学习非局部关系。我们采用掩膜顶点建模与未来帧预测作为两项自监督任务,以充分激活层级Transformer中的双向与自回归注意力。在主流MoCap基准测试中,所提方法与基于骨架和基于点云的模型相比,取得了最先进的性能。代码已开源至 https://github.com/zgzxy001/STMT。