The proliferation of video content demands efficient and flexible neural network based approaches for generating new video content. In this paper, we propose a novel approach that combines zero-shot text-to-video generation with ControlNet to improve the output of these models. Our method takes multiple sketched frames as input and generates video output that matches the flow of these frames, building upon the Text-to-Video Zero architecture and incorporating ControlNet to enable additional input conditions. By first interpolating frames between the inputted sketches and then running Text-to-Video Zero using the new interpolated frames video as the control technique, we leverage the benefits of both zero-shot text-to-video generation and the robust control provided by ControlNet. Experiments demonstrate that our method excels at producing high-quality and remarkably consistent video content that more accurately aligns with the user's intended motion for the subject within the video. We provide a comprehensive resource package, including a demo video, project website, open-source GitHub repository, and a Colab playground to foster further research and application of our proposed method.
翻译:视频内容的激增需要高效且灵活的基于神经网络的方法来生成新的视频内容。在本文中,我们提出了一种新颖的方法,将零样本文本到视频生成与ControlNet相结合,以改善这些模型的输出效果。我们的方法以多个草图帧作为输入,生成与这些帧流动相匹配的视频输出,该方法基于Text-to-Video Zero架构,并融入ControlNet以实现额外的输入条件。首先对输入的草图帧进行插值,然后将新的插值帧视频作为控制技术运行Text-to-Video Zero,从而同时利用了零样本文本到视频生成的优势以及ControlNet提供的强健控制能力。实验表明,我们的方法擅长生成高质量且高度一致的视频内容,这些内容能更准确地与用户对视频中主体的预期运动对齐。我们提供了一套全面的资源包,包括演示视频、项目网站、开源GitHub仓库以及Colab交互平台,以促进我们提出的方法的进一步研究与应用。