An Elastic Shape Variational Autoencoder for Skeleton Pose Trajectories

Deep generative models provide flexible frameworks for modeling complex, structured data such as images, videos, 3D objects, and texts. However, when applied to sequences of human skeletons, standard variational autoencoders (VAEs) often allocate substantial capacity to nuisance factors-such as camera orientation, subject scale, viewpoint, and execution speed-rather than the intrinsic geometry of shapes and their motion. We propose the Elastic Shape - Variational Autoencoder (ES-VAE), a geometry-aware generative model for skeletal trajectories that leverages the transported square-root velocity field (TSRVF) representation on Kendall's shape manifold. This representation inherently removes rigid translations, rotations, and global scaling of shapes, and temporal rate variability of sequences, isolating the underlying shape dynamics. The ES-VAE encoder maps skeletal sequences to a low-dimensional latent space incorporating the Riemannian logarithm map, while the decoder reconstructs sequences using the corresponding exponential map. We demonstrate the effectiveness of ES-VAE on two datasets. First, we analyze skeletal gait cycles to predict clinical mobility scores and classify subjects into healthy and post-stroke groups. Second, we evaluate action recognition on the NTU RGB+D dataset. Across both settings, ES-VAE consistently outperforms standard VAEs and a range of sequence modeling baselines, including temporal convolutional networks, transformers, and graph convolutional networks. More broadly, ES-VAE provides a principled framework for learning generative models of longitudinal data on pose shape manifolds, offering improved latent representation and downstream performance compared to existing deep learning approaches.

翻译：深度生成模型为建模图像、视频、三维物体及文本等复杂结构化数据提供了灵活的框架。然而，当应用于人体骨架序列时，标准变分自编码器往往将大量容量分配给干扰因素（如相机朝向、对象尺度、视角及执行速度），而非形状及其运动的固有几何特性。我们提出弹性形状变分自编码器（ES-VAE），这是一种面向骨架轨迹的几何感知生成模型，利用Kendall形状流形上的传输平方根速度场表示。该表示本质上去除了形状的刚性平移、旋转与全局缩放，以及序列的时间速率变化，从而隔离出底层形状动态。ES-VAE编码器结合黎曼对数映射将骨架序列映射至低维潜空间，而解码器则通过相应指数映射重构序列。我们在两个数据集上验证了ES-VAE的有效性：首先分析骨架步态周期以预测临床活动能力评分并将受试者分为健康与中风后两组，其次在NTU RGB+D数据集上评估动作识别性能。在两个场景中，ES-VAE均持续优于标准VAE及一系列序列建模基线方法（包括时间卷积网络、Transformer和图卷积网络）。更广泛而言，ES-VAE为学习姿态形状流形上纵向数据的生成模型提供了规范化框架，相较于现有深度学习方法，在潜空间表征质量与下游任务性能上均有显著提升。