SURF: Signature-Retained Fast Video Generation

The demand for high-resolution video generation is growing rapidly. However, the generation resolution is severely constrained by slow inference speeds. For instance, Wan2.1 requires over 50 minutes to generate a single 720p video. While previous works explore accelerating video generation from various aspects, most of them compromise the distinctive signatures (e.g., layout, semantic, motion) of the original model. In this work, we propose SURF, an efficient framework for generating high-resolution videos, while maximally keeping the signatures. Specifically, SURF divides video generation into two stages: First, we leverage the pretrained model to infer at optimal resolution and downsample latent to generate low-resolution previews in fast speed; then we design a Refiner to upscale the preview. In the preview stage, we identify that directly inferring a model (trained with higher resolution) on lower resolution causes severe losses in signatures. So we introduce noise reshifting, a training-free technique that mitigates this issue by conducting initial denoising steps on the original resolution and switching to low resolution in later steps. In the refine stage, we establish a mapping relationship between the preview and the high-resolution target, which significantly reduces the denoising steps. We further integrate shifting windows and carefully design the training paradigm to get a powerful and efficient Refiner. In this way, SURF enables generating high-resolution videos efficiently while maximally closer to the signatures of the given pretrained model. SURF is conceptually simple and could serve as a plug-in that is compatible with various base model and acceleration methods. For example, it achieves 12.5x speedup for generating 5-second, 16fps, 720p Wan 2.1 videos and 8.7x speedup for generating 5-second, 24fps, 720p HunyuanVideo.

翻译：高分辨率视频生成的需求正在快速增长。然而，生成分辨率受到推理速度缓慢的严重制约。例如，Wan2.1生成单个720p视频需要超过50分钟。尽管先前的研究从多个方面探索加速视频生成，但大多数方法会损害原始模型的特征（例如布局、语义、运动）。在本文中，我们提出SURF，一个高效生成高分辨率视频并最大化保留特征的框架。具体而言，SURF将视频生成分为两个阶段：首先，我们利用预训练模型在最优分辨率下推理，并对潜变量进行下采样以快速生成低分辨率预览；然后，我们设计一个Refiner对预览进行上采样。在预览阶段，我们发现直接在较低分辨率下推理一个（在较高分辨率下训练的）模型会导致特征严重损失。为此，我们引入噪声重移位（noise reshifting），一种无训练技术，通过在原始分辨率下进行初始去噪步骤，并在后续步骤切换到低分辨率来缓解此问题。在精炼阶段，我们建立预览与高分辨率目标之间的映射关系，这显著减少了去噪步骤。我们进一步集成滑动窗口并精心设计训练范式，以获得强大且高效的Refiner。通过这种方式，SURF在尽可能接近给定预训练模型特征的同时，高效生成高分辨率视频。SURF概念简单，可作为插件与多种基础模型和加速方法兼容。例如，它为生成5秒、16fps、720p的Wan2.1视频实现12.5倍加速，为生成5秒、24fps、720p的HunyuanVideo实现8.7倍加速。