DiT models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high-resolution outputs, further amplifying computational demands-especially for single-stage DiT models. To address these challenges, we propose a novel two-stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low-resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage achieves a nearly straight ODE trajectory between low and high resolutions via flow matching, effectively generating fine details and fixing artifacts with minimal NFEs. To ensure a seamless connection between the two independently trained stages during inference, we carefully design degradation strategies during the second-stage training. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high-resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output and accordingly adjust the prompt before committing to full-resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability.
翻译:DiT模型凭借其在模型容量和数据规模上的可扩展性,在文本到视频生成领域取得了巨大成功。然而,要实现与文本提示高度一致的内容和运动保真度,通常需要大量的模型参数和大量的函数评估次数(NFEs)。真实且视觉上吸引人的细节通常体现在高分辨率输出中,这进一步放大了计算需求——对于单阶段DiT模型尤其如此。为应对这些挑战,我们提出了一种新颖的两阶段框架FlashVideo,该框架通过策略性地在阶段间分配模型容量和NFEs,以平衡生成保真度与质量。在第一阶段,通过利用大参数和充足的NFEs进行低分辨率生成过程,优先保证提示保真度,同时提升计算效率。第二阶段通过流匹配实现低分辨率与高分辨率之间近乎直线的ODE轨迹,从而以最少的NFEs有效生成精细细节并修复伪影。为确保推理时两个独立训练阶段之间的无缝衔接,我们在第二阶段训练中精心设计了退化策略。定量和视觉结果表明,FlashVideo以卓越的计算效率实现了最先进的高分辨率视频生成。此外,两阶段设计使用户能够在预览初始输出后,根据需要对提示进行调整,然后再进行全分辨率生成,从而显著降低计算成本与等待时间,并增强了商业可行性。