Hands are central to interacting with our surroundings and conveying gestures, making their inclusion essential for full-body motion synthesis. Despite this, existing human motion synthesis methods fall short: some ignore hand motions entirely, while others generate full-body motions only for narrowly scoped tasks under highly constrained settings. A key obstacle is the lack of large-scale datasets that jointly capture diverse full-body motion with detailed hand articulation. While some datasets capture both, they are limited in scale and diversity. Conversely, large-scale datasets typically focus either on body motion without hands or on hand motions without the body. To overcome this, we curate and unify existing hand motion datasets with large-scale body motion data to generate full-body sequences that capture both hand and body. We then propose the first diffusion-based unconditional full-body motion prior, FUSION, which jointly models body and hand motion. Despite using a pose-based motion representation, FUSION surpasses state-of-the-art skeletal control models on the Keypoint Tracking task in the HumanML3D dataset and achieves superior motion naturalness. Beyond standard benchmarks, we demonstrate that FUSION can go beyond typical uses of motion priors through two applications: (1) generating detailed full-body motion including fingers during interaction given the motion of an object, and (2) generating Self-Interaction motions using an LLM to transform natural language cues into actionable motion constraints. For these applications, we develop an optimization pipeline that refines the latent space of our diffusion model to generate task-specific motions. Experiments on these tasks highlight precise control over hand motion while maintaining plausible full-body coordination. The code will be public.
翻译:手部在与环境交互和传达手势中起着核心作用,因此将其纳入全身运动合成至关重要。然而,现有的人体运动合成方法存在不足:有些完全忽略手部运动,而另一些仅在高度受限的设定下为狭窄范围的任务生成全身运动。一个关键障碍是缺乏大规模数据集来联合捕捉具有精细手部关节的多样化全身运动。尽管部分数据集同时包含两者,但其规模和多样性有限。相反,大规模数据集通常要么关注不含手部的身体运动,要么关注不含身体的手部运动。为克服此问题,我们整理并统一了现有手部运动数据集与大规模身体运动数据,以生成同时捕捉手部和身体的全身序列。随后,我们提出了首个基于扩散的无条件全身运动先验模型FUSION,该模型联合建模身体与手部运动。尽管使用基于姿态的运动表示,FUSION在HumanML3D数据集的关键点跟踪任务上超越了最先进的骨骼控制模型,并实现了更优的运动自然度。除标准基准测试外,我们通过两个应用证明FUSION能够超越运动先验模型的典型用途:(1)在给定物体运动的情况下,生成包含交互过程中手指细节的全身运动;(2)利用大型语言模型将自然语言提示转化为可执行的运动约束,以生成自交互运动。针对这些应用,我们开发了一种优化流程,通过精调扩散模型的潜在空间来生成任务特定运动。在这些任务上的实验突显了对手部运动的精确控制能力,同时保持了合理的全身协调性。代码将公开。