VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space

Human Pose and Shape Estimation (HPSE) from RGB images can be broadly categorized into two main groups: parametric and non-parametric approaches. Parametric techniques leverage a low-dimensional statistical body model for realistic results, whereas recent non-parametric methods achieve higher precision by directly regressing the 3D coordinates of the human body. Despite their strengths, both approaches face limitations: the parameters of statistical body models pose challenges as regression targets, and predicting 3D coordinates introduces computational complexities and issues related to smoothness. In this work, we take a novel approach to address the HPSE problem. We introduce a unique method involving a low-dimensional discrete latent representation of the human mesh, framing HPSE as a classification task. Instead of predicting body model parameters or 3D vertex coordinates, our focus is on forecasting the proposed discrete latent representation, which can be decoded into a registered human mesh. This innovative paradigm offers two key advantages: firstly, predicting a low-dimensional discrete representation confines our predictions to the space of anthropomorphic poses and shapes; secondly, by framing the problem as a classification task, we can harness the discriminative power inherent in neural networks. Our proposed model, VQ-HPS, a transformer-based architecture, forecasts the discrete latent representation of the mesh, trained through minimizing a cross-entropy loss. Our results demonstrate that VQ-HPS outperforms the current state-of-the-art non-parametric approaches while yielding results as realistic as those produced by parametric methods. This highlights the significant potential of the classification approach for HPSE.

翻译：从RGB图像进行人体姿态与形状估计（HPSE）可分为参数化与非参数化两类方法。参数化方法利用低维统计身体模型生成真实结果，而近年非参数化方法通过直接回归人体三维坐标达到更高精度。尽管各有优势，两种方法均存在局限：统计身体模型参数作为回归目标具有挑战性，而预测三维坐标则会引入计算复杂性与平滑性问题。本研究提出了一种解决HPSE问题的新范式，引入基于人体网格低维离散潜表征的独特方法，将HPSE重构为分类任务。我们不再预测身体模型参数或三维顶点坐标，而是聚焦于预测所提出的离散潜表征，该表征可解码为配准后的人体网格。这一创新范式具有两大优势：其一，预测低维离散表征可将预测结果约束在拟人形态与形状空间内；其二，通过将问题转化为分类任务，可充分利用神经网络固有的判别能力。所提出的VQ-HPS模型基于Transformer架构，通过最小化交叉熵损失函数训练，预测网格的离散潜表征。实验结果表明，VQ-HPS在超越现有最优非参数化方法的同时，能生成与参数化方法同样真实的结果，充分展现了分类方法在HPSE领域的巨大潜力。