Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from short prompts and specified durations. However, using a single text prompt as input lacks the fine-grained control needed by animators, such as composing multiple actions and defining precise durations for parts of the motion. To address this, we introduce the new problem of timeline control for text-driven motion synthesis, which provides an intuitive, yet fine-grained, input interface for users. Instead of a single prompt, users can specify a multi-track timeline of multiple prompts organized in temporal intervals that may overlap. This enables specifying the exact timings of each action and composing multiple actions in sequence or at overlapping intervals. To generate composite animations from a multi-track timeline, we propose a new test-time denoising method. This method can be integrated with any pre-trained motion diffusion model to synthesize realistic motions that accurately reflect the timeline. At every step of denoising, our method processes each timeline interval (text prompt) individually, subsequently aggregating the predictions with consideration for the specific body parts engaged in each action. Experimental comparisons and ablations validate that our method produces realistic motions that respect the semantics and timing of given text prompts. Our code and models are publicly available at https://mathis.petrovich.fr/stmc.
翻译:生成建模领域的最新进展使得从文本合成三维人体运动取得了令人鼓舞的进展,现有方法能够根据简短提示和指定时长生成角色动画。然而,使用单一文本提示作为输入缺乏动画师所需的细粒度控制,例如组合多个动作以及为运动各部分定义精确的持续时间。为解决此问题,我们引入了文本驱动运动合成中的时间轴控制这一新问题,为用户提供了一个直观且细粒度的输入界面。用户无需提供单一提示,而是可以指定一个多轨时间轴,其中包含按时间区间组织的多个提示,这些区间可以重叠。这使得用户能够指定每个动作的确切时间点,并按顺序或在重叠区间内组合多个动作。为了从多轨时间轴生成复合动画,我们提出了一种新的测试时去噪方法。该方法可与任何预训练的运动扩散模型集成,以合成准确反映时间轴的真实运动。在去噪的每一步,我们的方法独立处理每个时间轴区间(文本提示),随后根据每个动作所涉及的具体身体部位聚合预测结果。实验对比与消融研究验证了我们的方法能够生成真实自然的运动,并严格遵守给定文本提示的语义和时序。我们的代码和模型已在 https://mathis.petrovich.fr/stmc 公开提供。