SpATr: MoCap 3D Human Action Recognition based on Spiral Auto-encoder and Transformer Network

Recent advancements in technology have expanded the possibilities of human action recognition by leveraging 3D data, which offers a richer representation of actions through the inclusion of depth information, enabling more accurate analysis of spatial and temporal characteristics. However, 3D human action recognition is a challenging task due to the irregularity and Disarrangement of the data points in action sequences. In this context, we present our novel model for human action recognition from fixed topology mesh sequences based on Spiral Auto-encoder and Transformer Network, namely SpATr. The proposed method first disentangles space and time in the mesh sequences. Then, an auto-encoder is utilized to extract spatial geometrical features, and tiny transformer is used to capture the temporal evolution of the sequence. Previous methods either use 2D depth images, sample skeletons points or they require a huge amount of memory leading to the ability to process short sequences only. In this work, we show competitive recognition rate and high memory efficiency by building our auto-encoder based on spiral convolutions, which are light weight convolution directly applied to mesh data with fixed topologies, and by modeling temporal evolution using a attention, that can handle large sequences. The proposed method is evaluated on on two 3D human action datasets: MoVi and BMLrub from the Archive of Motion Capture As Surface Shapes (AMASS). The results analysis shows the effectiveness of our method in 3D human action recognition while maintaining high memory efficiency. The code will soon be made publicly available.

翻译：近年来，技术的进步通过利用3D数据拓展了人体动作识别的可能性。3D数据通过包含深度信息提供了更丰富的动作表征，从而能够更精确地分析空间和时间特征。然而，由于动作序列中数据点的不规则性和无序性，3D人体动作识别仍是一项具有挑战性的任务。在此背景下，我们提出了一个基于螺旋自编码器与Transformer网络的新型模型，用于从固定拓扑网格序列中实现人体动作识别，即SpATr。该方法首先将网格序列中的空间与时间信息进行解耦，随后利用自编码器提取空间几何特征，并通过微型Transformer捕获序列的时间演化过程。此前的方法要么使用2D深度图像，要么采样骨骼关键点，或者需要极大的内存资源，因而只能处理短序列。本工作中，我们通过构建基于螺旋卷积的自编码器（一种轻量级卷积，可直接应用于固定拓扑网格数据），并利用注意力机制建模时间演化（能够处理长序列），展现了具有竞争力的识别率与高内存效率。该方法在两个3D人体动作数据集（源自运动捕获表面形状档案库AMASS的MoVi与BMLrub）上进行了评估。结果分析表明，我们的方法在3D人体动作识别中具有有效性，同时保持了高内存效率。代码将很快公开发布。