Generating realistic human videos remains a challenging task, with the most effective methods currently relying on a human motion sequence as a control signal. Existing approaches often use existing motion extracted from other videos, which restricts applications to specific motion types and global scene matching. We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image, allowing for diverse motion that adapts to different scenes. Our approach utilizes a diffusion model that accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene. To train this model, we collect a large-scale video dataset featuring single-human activities, annotating each video with the corresponding human motion as the target output. Experiments demonstrate that our method effectively predicts human motion that aligns with the scene image after projection. Furthermore, we show that the generated motion sequence improves human motion quality in video synthesis tasks.
翻译:生成逼真的人类视频仍然是一项具有挑战性的任务,目前最有效的方法依赖于将人类动作序列作为控制信号。现有方法通常使用从其他视频中提取的现有动作,这限制了其只能应用于特定的动作类型和全局场景匹配。我们提出了Move-in-2D,一种基于场景图像生成人类动作序列的新方法,允许生成适应不同场景的多样化动作。我们的方法采用一个扩散模型,该模型同时接受场景图像和文本提示作为输入,并生成与场景相匹配的动作序列。为了训练该模型,我们收集了一个大规模的单人活动视频数据集,并将每个视频中对应的人类动作标注为目标输出。实验表明,我们的方法在投影后能有效预测出与场景图像一致的人类动作。此外,我们还证明了所生成的动作序列能够提升视频合成任务中人类动作的质量。