Generating long and consistent videos has emerged as a significant yet challenging problem. While most existing diffusion-based video generation models, derived from image generation models, demonstrate promising performance in generating short videos, their simple conditioning mechanism and sampling strategy-originally designed for image generation-cause severe performance degradation when adapted to long video generation. This results in prominent temporal inconsistency and overexposure. Thus, in this work, we introduce FlexiFilm, a new diffusion model tailored for long video generation. Our framework incorporates a temporal conditioner to establish a more consistent relationship between generation and multi-modal conditions, and a resampling strategy to tackle overexposure. Empirical results demonstrate FlexiFilm generates long and consistent videos, each over 30 seconds in length, outperforming competitors in qualitative and quantitative analyses. Project page: https://y-ichen.github.io/FlexiFilm-Page/
翻译:生成长而连贯的视频已成为一项重要但极具挑战性的任务。尽管大多数基于扩散模型的现有视频生成模型(衍生自图像生成模型)在短视频生成中展示了出色性能,但其最初为图像生成设计的简单条件控制机制和采样策略在扩展到长视频生成时会导致严重的性能退化,引发明显的时间不连续性和过曝问题。为此,本文提出FlexiFilm——一种专为长视频生成定制的新型扩散模型。该框架整合了时间条件器以建立生成过程与多模态条件之间更一致的关联,并采用重采样策略来解决过曝问题。实验结果表明,FlexiFilm能够生成长度超过30秒的连贯视频,在定性和定量分析中均优于现有竞争者。项目页面:https://y-ichen.github.io/FlexiFilm-Page/