T2M-HiFiGPT: Generating High Quality Human Motion from Textual Descriptions with Residual Discrete Representations

In this study, we introduce T2M-HiFiGPT, a novel conditional generative framework for synthesizing human motion from textual descriptions. This framework is underpinned by a Residual Vector Quantized Variational AutoEncoder (RVQ-VAE) and a double-tier Generative Pretrained Transformer (GPT) architecture. We demonstrate that our CNN-based RVQ-VAE is capable of producing highly accurate 2D temporal-residual discrete motion representations. Our proposed double-tier GPT structure comprises a temporal GPT and a residual GPT. The temporal GPT efficiently condenses information from previous frames and textual descriptions into a 1D context vector. This vector then serves as a context prompt for the residual GPT, which generates the final residual discrete indices. These indices are subsequently transformed back into motion data by the RVQ-VAE decoder. To mitigate the exposure bias issue, we employ straightforward code corruption techniques for RVQ and a conditional dropout strategy, resulting in enhanced synthesis performance. Remarkably, T2M-HiFiGPT not only simplifies the generative process but also surpasses existing methods in both performance and parameter efficacy, including the latest diffusion-based and GPT-based models. On the HumanML3D and KIT-ML datasets, our framework achieves exceptional results across nearly all primary metrics. We further validate the efficacy of our framework through comprehensive ablation studies on the HumanML3D dataset, examining the contribution of each component. Our findings reveal that RVQ-VAE is more adept at capturing precise 3D human motion with comparable computational demand compared to its VQ-VAE counterparts. As a result, T2M-HiFiGPT enables the generation of human motion with significantly increased accuracy, outperforming recent state-of-the-art approaches such as T2M-GPT and Att-T2M.

翻译：本研究提出T2M-HiFiGPT，一种从文本描述合成人体运动的新型条件生成框架。该框架基于残差向量量化变分自编码器（RVQ-VAE）和双层生成式预训练Transformer（GPT）架构。我们证明，基于CNN的RVQ-VAE能够生成高精度的二维时间残差离散运动表示。我们提出的双层GPT结构包括一个时间GPT和一个残差GPT：时间GPT将先前帧和文本描述的信息高效压缩为一维上下文向量，该向量作为残差GPT的上下文提示，用于生成最终的残差离散索引；这些索引随后由RVQ-VAE解码器转换回运动数据。为缓解曝光偏差问题，我们对RVQ采用简单代码损坏策略，并引入条件丢弃方法，从而提升合成性能。值得注意的是，T2M-HiFiGPT不仅简化了生成过程，还在性能和参数效率上超越现有方法（包括最新的扩散模型和GPT模型）。在HumanML3D和KIT-ML数据集上，我们的框架在几乎所有主要指标上均取得优异结果。通过在HumanML3D数据集上进行全面的消融研究，我们进一步验证了框架各组成部分的有效性。研究表明，与VQ-VAE相比，RVQ-VAE在计算需求相近的情况下，能更精准地捕捉三维人体运动。因此，T2M-HiFiGPT能够生成精度显著提升的人体运动，优于T2M-GPT和Att-T2M等最新最先进方法。