3D human motion generation is crucial for creative industry. Recent advances rely on generative models with domain knowledge for text-driven motion generation, leading to substantial progress in capturing common motions. However, the performance on more diverse motions remains unsatisfactory. In this work, we propose ReMoDiffuse, a diffusion-model-based motion generation framework that integrates a retrieval mechanism to refine the denoising process. ReMoDiffuse enhances the generalizability and diversity of text-driven motion generation with three key designs: 1) Hybrid Retrieval finds appropriate references from the database in terms of both semantic and kinematic similarities. 2) Semantic-Modulated Transformer selectively absorbs retrieval knowledge, adapting to the difference between retrieved samples and the target motion sequence. 3) Condition Mixture better utilizes the retrieval database during inference, overcoming the scale sensitivity in classifier-free guidance. Extensive experiments demonstrate that ReMoDiffuse outperforms state-of-the-art methods by balancing both text-motion consistency and motion quality, especially for more diverse motion generation.
翻译:三维人体运动生成对于创意产业至关重要。近期研究依赖结合领域知识的生成模型实现文本驱动运动生成,在捕获常见运动方面取得了显著进展。然而,针对更多样化运动的性能仍不尽如人意。本文提出ReMoDiffuse——一种基于扩散模型的运动生成框架,通过集成检索机制优化去噪过程。ReMoDiffuse通过三个关键设计增强文本驱动运动的泛化性和多样性:1) 混合检索从数据库中同时依据语义和运动学相似性找到恰当参考;2) 语义调制变换器选择性吸收检索知识,自适应检索样本与目标运动序列间的差异;3) 条件混合在推理阶段更充分利用检索数据库,克服无分类器引导中的尺度敏感性。大量实验表明,ReMoDiffuse通过在文本-运动一致性与运动质量之间取得平衡,尤其针对更多样化运动生成任务,其性能超越当前最先进方法。