Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from short prompts and specified durations. However, using a single text prompt as input lacks the fine-grained control needed by animators, such as composing multiple actions and defining precise durations for parts of the motion. To address this, we introduce the new problem of timeline control for text-driven motion synthesis, which provides an intuitive, yet fine-grained, input interface for users. Instead of a single prompt, users can specify a multi-track timeline of multiple prompts organized in temporal intervals that may overlap. This enables specifying the exact timings of each action and composing multiple actions in sequence or at overlapping intervals. To generate composite animations from a multi-track timeline, we propose a new test-time denoising method. This method can be integrated with any pre-trained motion diffusion model to synthesize realistic motions that accurately reflect the timeline. At every step of denoising, our method processes each timeline interval (text prompt) individually, subsequently aggregating the predictions with consideration for the specific body parts engaged in each action. Experimental comparisons and ablations validate that our method produces realistic motions that respect the semantics and timing of given text prompts. Our code and models are publicly available at https://mathis.petrovich.fr/stmc.
翻译:近期生成建模领域的进展为文本驱动的人体三维运动合成带来了显著突破,现有方法可从简短的文本提示和指定时长生成角色动画。然而,仅采用单一文本提示作为输入缺乏动画师所需的精细控制,例如组合多种动作、定义运动片段的精确时长。为解决该问题,我们提出文本驱动运动合成中时间轴控制的新课题,为用户提供直观且细粒度的输入接口。与单一提示不同,用户可指定包含多个按时间间隔组织(可能存在重叠)的提示的多轨道时间轴,从而精确阐明各动作的时间节点,并以顺序或重叠方式组合多种动作。为从多轨道时间轴生成复合动画,我们提出一种新型测试时降噪方法,该方法可与任意预训练的运动扩散模型集成,合成真实反映时间轴信息的动态效果。在每步降噪过程中,我们的方法独立处理各时间间隔(文本提示),随后聚合预测结果时综合考虑各动作涉及的具体身体部位。实验对比与消融研究证实,本方法能生成符合给定文本语义和时序要求的逼真运动。相关代码与模型已开源发布于 https://mathis.petrovich.fr/stmc。