This paper introduces OmniMotion-X, a versatile multimodal framework for whole-body human motion generation, leveraging an autoregressive diffusion transformer in a unified sequence-to-sequence manner. OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture, and global spatial-temporal control scenarios (e.g., motion prediction, in-betweening, completion, and joint/trajectory-guided synthesis), as well as flexible combinations of these tasks. Specifically, we propose the use of reference motion as a novel conditioning signal, substantially enhancing the consistency of generated content, style, and temporal dynamics crucial for realistic animations. To handle multimodal conflicts, we introduce a progressive weak-to-strong mixed-condition training strategy. To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date, integrating 28 publicly available MoCap sources across 10 distinct tasks, standardized to the SMPL-X format at 30 fps. To ensure detailed and consistent annotations, we render sequences into videos and use GPT-4o to automatically generate structured and hierarchical captions, capturing both low-level actions and high-level semantics. Extensive experimental evaluations confirm that OmniMotion-X significantly surpasses existing methods, demonstrating state-of-the-art performance across multiple multimodal tasks and enabling the interactive generation of realistic, coherent, and controllable long-duration motions.
翻译:本文提出OmniMotion-X,一种多功能多模态全身人体运动生成框架,采用自回归扩散Transformer以统一的序列到序列方式实现。OmniMotion-X高效支持多种多模态任务,包括文本驱动运动、音乐驱动舞蹈、语音驱动手势及全局时空控制场景(如运动预测、中间帧生成、运动补全及关节/轨迹引导合成),并能灵活组合这些任务。具体而言,我们提出将参考运动作为新型条件信号,显著增强了生成内容在一致性、风格及时序动态方面的表现,这对实现逼真动画至关重要。为处理多模态冲突,我们引入了渐进式弱到强混合条件训练策略。为实现高质量多模态训练,我们构建了迄今最大的统一多模态运动数据集OmniMoCap-X,整合了涵盖10个不同任务的28个公开MoCap数据源,并统一标准化为30帧/秒的SMPL-X格式。为确保细致且一致的标注,我们将运动序列渲染为视频,并利用GPT-4o自动生成结构化分层描述,同时捕捉低层动作与高层语义。大量实验评估证实,OmniMotion-X显著超越现有方法,在多项多模态任务中展现出最先进的性能,并能交互式生成逼真、连贯且可控的长时序运动。