Thanks to the powerful generative capacity of diffusion models, recent years have witnessed rapid progress in human motion generation. Existing diffusion-based methods employ disparate network architectures and training strategies. The effect of the design of each component is still unclear. In addition, the iterative denoising process consumes considerable computational overhead, which is prohibitive for real-time scenarios such as virtual characters and humanoid robots. For this reason, we first conduct a comprehensive investigation into network architectures, training strategies, and inference processs. Based on the profound analysis, we tailor each component for efficient high-quality human motion generation. Despite the promising performance, the tailored model still suffers from foot skating which is an ubiquitous issue in diffusion-based solutions. To eliminate footskate, we identify foot-ground contact and correct foot motions along the denoising process. By organically combining these well-designed components together, we present StableMoFusion, a robust and efficient framework for human motion generation. Extensive experimental results show that our StableMoFusion performs favorably against current state-of-the-art methods. Project page: https://h-y1heng.github.io/StableMoFusion-page/
翻译:得益于扩散模型强大的生成能力,近年来人体运动生成领域取得了飞速进展。现有基于扩散的方法采用了不同的网络架构与训练策略,但各组件设计的实际影响仍不明确。此外,迭代去噪过程消耗了大量计算资源,这限制了其在虚拟角色、人形机器人等实时场景中的应用。为此,我们首先对网络架构、训练策略及推理过程进行了系统研究。基于深度分析,我们针对高效高质量人体运动生成任务定制了每个组件。尽管取得了令人瞩目的性能,但定制模型仍存在足部滑步问题——这是扩散式方案中的常见缺陷。为消除足部滑步,我们通过去噪过程识别足地接触关系并修正脚部运动。通过有机整合这些精心设计的组件,我们提出了StableMoFusion——一个鲁棒高效的人体运动生成框架。大量实验结果表明,我们的StableMoFusion在性能上优于当前最先进的方法。项目主页:https://h-y1heng.github.io/StableMoFusion-page/