Diffusion models have seen widespread adoption for text-driven human motion generation and related tasks due to their impressive generative capabilities and flexibility. However, current motion diffusion models face two major limitations: a representational gap caused by pre-trained text encoders that lack motion-specific information, and error propagation during the iterative denoising process. This paper introduces Reconstruction-Anchored Diffusion Model (RAM) to address these challenges. First, RAM leverages a motion latent space as intermediate supervision for text-to-motion generation. To this end, RAM co-trains a motion reconstruction branch with two key objective functions: self-regularization to enhance the discrimination of the motion space and motion-centric latent alignment to enable accurate mapping from text to the motion latent space. Second, we propose Reconstructive Error Guidance (REG), a testing-stage guidance mechanism that exploits the diffusion model's inherent self-correction ability to mitigate error propagation. At each denoising step, REG uses the motion reconstruction branch to reconstruct the previous estimate, reproducing the prior error patterns. By amplifying the residual between the current prediction and the reconstructed estimate, REG highlights the improvements in the current prediction. Extensive experiments demonstrate that RAM achieves significant improvements and state-of-the-art performance. Our code will be released.
翻译:扩散模型因其卓越的生成能力和灵活性,在文本驱动的人体动作生成及相关任务中得到了广泛采用。然而,当前的动作扩散模型面临两大局限:一是由缺乏动作特定信息的预训练文本编码器引起的表征鸿沟,二是在迭代去噪过程中的误差传播。本文引入基于重建锚定的扩散模型(RAM)以应对这些挑战。首先,RAM利用一个动作潜在空间作为文本到动作生成的中间监督。为此,RAM协同训练一个动作重建分支,其包含两个关键目标函数:自正则化以增强动作空间的判别力,以及以动作为中心的潜在对齐以实现从文本到动作潜在空间的精确映射。其次,我们提出重建误差引导(REG),这是一种测试阶段的引导机制,它利用扩散模型固有的自校正能力来减轻误差传播。在每一步去噪过程中,REG使用动作重建分支对先前的估计进行重建,从而复现先前的误差模式。通过放大当前预测与重建估计之间的残差,REG突显了当前预测的改进。大量实验表明,RAM取得了显著的性能提升和最先进的性能。我们的代码将公开发布。