Synthesizing human motions in 3D environments, particularly those with complex activities such as locomotion, hand-reaching, and human-object interaction, presents substantial demands for user-defined waypoints and stage transitions. These requirements pose challenges for current models, leading to a notable gap in automating the animation of characters from simple human inputs. This paper addresses this challenge by introducing a comprehensive framework for synthesizing multi-stage scene-aware interaction motions directly from a single text instruction and goal location. Our approach employs an auto-regressive diffusion model to synthesize the next motion segment, along with an autonomous scheduler predicting the transition for each action stage. To ensure that the synthesized motions are seamlessly integrated within the environment, we propose a scene representation that considers the local perception both at the start and the goal location. We further enhance the coherence of the generated motion by integrating frame embeddings with language input. Additionally, to support model training, we present a comprehensive motion-captured dataset comprising 16 hours of motion sequences in 120 indoor scenes covering 40 types of motions, each annotated with precise language descriptions. Experimental results demonstrate the efficacy of our method in generating high-quality, multi-stage motions closely aligned with environmental and textual conditions.
翻译:在三维环境中合成人体运动,特别是涉及复杂活动(如移动、伸手取物和人-物交互)的运动,对用户定义的路径点和阶段转换提出了极高要求。这些要求对现有模型构成挑战,导致从简单人类输入自动化生成角色动画存在显著差距。本文通过提出一个直接从单一文本指令和目标位置合成多阶段场景感知交互运动的完整框架来解决这一挑战。我们的方法采用自回归扩散模型来合成下一运动片段,同时通过自主调度器预测每个动作阶段的转换。为确保合成运动与环境无缝融合,我们提出了一种同时考虑起始位置和目标位置局部感知的场景表示方法。我们进一步通过将帧嵌入与语言输入相结合来增强生成运动的连贯性。此外,为支持模型训练,我们构建了一个包含16小时运动序列的综合运动捕捉数据集,涵盖120个室内场景中的40种运动类型,每条序列均配有精确的语言描述标注。实验结果表明,我们的方法能有效生成与环境和文本条件高度契合的高质量多阶段运动。