Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. This task involves synthesizing background that aligns with the motion and appearance of the foreground subject, while also complies with the artist's creative intention. We introduce ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts. Our model leverages the power of large-scale video diffusion models, and is specifically tailored for this task. ActAnywhere takes a sequence of foreground subject segmentation as input and an image that describes the desired scene as condition, to produce a coherent video with realistic foreground-background interactions while adhering to the condition frame. We train our model on a large-scale dataset of human-scene interaction videos. Extensive evaluations demonstrate the superior performance of our model, significantly outperforming baselines. Moreover, we show that ActAnywhere generalizes to diverse out-of-distribution samples, including non-human subjects. Please visit our project webpage at https://actanywhere.github.io.
翻译:为电影行业和视觉效果社区生成与前景主体运动相匹配的视频背景是一个重要问题。该任务涉及合成与前景主体运动和外观对齐的背景,同时还需符合艺术家的创作意图。我们提出ActAnywhere,一种自动化这一传统上需要大量人工操作流程的生成模型。该模型借助大规模视频扩散模型的力量,并针对此任务进行了专门优化。ActAnywhere以前景主体分割序列作为输入,并以描述期望场景的图像作为条件,生成具有逼真前景-背景交互且符合条件帧的连贯视频。我们在大规模人-场景交互视频数据集上训练该模型。大量评估表明,该模型性能卓越,显著优于基线方法。此外,我们证明ActAnywhere可泛化至多种分布外样本(包括非人类主体)。请访问项目网页:https://actanywhere.github.io。