Video Diffusion Models have been developed for video generation, usually integrating text and image conditioning to enhance control over the generated content. Despite the progress, ensuring consistency across frames remains a challenge, particularly when using text prompts as control conditions. To address this problem, we introduce UniCtrl, a novel, plug-and-play method that is universally applicable to improve the spatiotemporal consistency and motion diversity of videos generated by text-to-video models without additional training. UniCtrl ensures semantic consistency across different frames through cross-frame self-attention control, and meanwhile, enhances the motion quality and spatiotemporal consistency through motion injection and spatiotemporal synchronization. Our experimental results demonstrate UniCtrl's efficacy in enhancing various text-to-video models, confirming its effectiveness and universality.
翻译:视频扩散模型已被开发用于视频生成,通常结合文本和图像条件来增强对生成内容的控制。尽管取得了进展,但确保帧间一致性仍是一项挑战,尤其是在使用文本提示作为控制条件时。为解决此问题,我们引入了UniCtrl——一种新颖的即插即用方法,可通用地提升文本到视频模型生成视频的时空一致性和运动多样性,且无需额外训练。UniCtrl通过跨帧自注意力控制确保不同帧之间的语义一致性,同时通过运动注入和时空同步增强运动质量与时空一致性。我们的实验结果证明了UniCtrl在增强多种文本到视频模型方面的有效性,证实了其有效性和通用性。