SpATr: MoCap 3D Human Action Recognition based on Spiral Auto-encoder and Transformer Network

Recent technological advancements have significantly expanded the potential of human action recognition through harnessing the power of 3D data. This data provides a richer understanding of actions, including depth information that enables more accurate analysis of spatial and temporal characteristics. In this context, We study the challenge of 3D human action recognition.Unlike prior methods, that rely on sampling 2D depth images, skeleton points, or point clouds, often leading to substantial memory requirements and the ability to handle only short sequences, we introduce a novel approach for 3D human action recognition, denoted as SpATr (Spiral Auto-encoder and Transformer Network), specifically designed for fixed-topology mesh sequences. The SpATr model disentangles space and time in the mesh sequences. A lightweight auto-encoder, based on spiral convolutions, is employed to extract spatial geometrical features from each 3D mesh. These convolutions are lightweight and specifically designed for fix-topology mesh data. Subsequently, a temporal transformer, based on self-attention, captures the temporal context within the feature sequence. The self-attention mechanism enables long-range dependencies capturing and parallel processing, ensuring scalability for long sequences. The proposed method is evaluated on three prominent 3D human action datasets: Babel, MoVi, and BMLrub, from the Archive of Motion Capture As Surface Shapes (AMASS). Our results analysis demonstrates the competitive performance of our SpATr model in 3D human action recognition while maintaining efficient memory usage. The code and the training results will soon be made publicly available at https://github.com/h-bouzid/spatr.

翻译：近年来，随着技术的进步，利用三维数据进行人体动作识别的潜力得到了显著拓展。三维数据能够提供对动作更丰富的理解，其包含的深度信息使得对动作空间与时间特征的更精确分析成为可能。在此背景下，我们研究了三维人体动作识别这一挑战。与以往依赖采样二维深度图像、骨架点或点云的方法不同（这些方法通常需要大量内存且仅能处理短序列），我们提出了一种新颖的三维人体动作识别方法，命名为SpATr（螺旋自编码器与Transformer网络），该方法专为固定拓扑网格序列设计。SpATr模型将网格序列中的空间与时间信息解耦。首先，一个基于螺旋卷积的轻量级自编码器被用于从每个三维网格中提取空间几何特征。该卷积运算轻量化且专为固定拓扑网格数据设计。随后，一个基于自注意力机制的时序Transformer捕捉特征序列中的时间上下文信息。自注意力机制能够捕获长程依赖并支持并行处理，从而确保了对长序列的可扩展性。我们在来自“运动捕捉表面形状档案”（AMASS）的三个重要三维人体动作数据集：Babel、MoVi和BMLrub上对所提方法进行了评估。我们的结果分析表明，SpATr模型在三维人体动作识别任务中具有竞争力的性能，同时保持了高效的内存使用。代码及训练结果将很快在 https://github.com/h-bouzid/spatr 公开。