Conventional methods for human motion synthesis are either deterministic or struggle with the trade-off between motion diversity and motion quality. In response to these limitations, we introduce MoFusion, i.e., a new denoising-diffusion-based framework for high-quality conditional human motion synthesis that can generate long, temporally plausible, and semantically accurate motions based on a range of conditioning contexts (such as music and text). We also present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework through our scheduled weighting strategy. The learned latent space can be used for several interactive motion editing applications -- like inbetweening, seed conditioning, and text-based editing -- thus, providing crucial abilities for virtual character animation and robotics. Through comprehensive quantitative evaluations and a perceptual user study, we demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature. We urge the reader to watch our supplementary video and visit https://vcai.mpi-inf.mpg.de/projects/MoFusion.
翻译:传统的人体运动合成方法要么是确定性的,要么难以在运动多样性与运动质量之间取得平衡。针对这些局限,我们提出了MoFusion——一种基于去噪扩散的新框架,用于高质量的条件式人体运动合成。该框架能够基于多种条件上下文(如音乐和文本)生成时长较长、时间上连贯且语义准确的运动。我们还提出了通过调度加权策略,将用于运动合理性的经典运动学损失引入运动扩散框架的方法。所学的潜在空间可用于多种交互式运动编辑应用(如插帧、种子条件生成和基于文本的编辑),从而为虚拟角色动画和机器人技术提供关键能力。通过全面的定量评估和感知用户研究,我们证明了MoFusion相较于现有文献中已确立基准的最新方法的有效性。我们建议读者观看补充视频并访问https://vcai.mpi-inf.mpg.de/projects/MoFusion。