Video diffusion models (VDMs) facilitate the generation of high-quality videos, with current research predominantly concentrated on scaling efforts during training through improvements in data quality, computational resources, and model complexity. However, inference-time scaling has received less attention, with most approaches restricting models to a single generation attempt. Recent studies have uncovered the existence of "golden noises" that can enhance video quality during generation. Building on this, we find that guiding the scaling inference-time search of VDMs to identify better noise candidates not only evaluates the quality of the frames generated in the current step but also preserves the high-level object features by referencing the anchor frame from previous multi-chunks, thereby delivering long-term value. Our analysis reveals that diffusion models inherently possess flexible adjustments of computation by varying denoising steps, and even a one-step denoising approach, when guided by a reward signal, yields significant long-term benefits. Based on the observation, we proposeScalingNoise, a plug-and-play inference-time search strategy that identifies golden initial noises for the diffusion sampling process to improve global content consistency and visual diversity. Specifically, we perform one-step denoising to convert initial noises into a clip and subsequently evaluate its long-term value, leveraging a reward model anchored by previously generated content. Moreover, to preserve diversity, we sample candidates from a tilted noise distribution that up-weights promising noises. In this way, ScalingNoise significantly reduces noise-induced errors, ensuring more coherent and spatiotemporally consistent video generation. Extensive experiments on benchmark datasets demonstrate that the proposed ScalingNoise effectively improves long video generation.
翻译:视频扩散模型(VDMs)能够生成高质量视频,当前研究主要集中在通过提升数据质量、计算资源和模型复杂度来扩展训练阶段。然而,推理时的扩展研究较少受到关注,多数方法将模型限制在单次生成尝试上。近期研究发现存在能够提升生成视频质量的“黄金噪声”。基于此,我们发现引导VDMs进行扩展推理时搜索以识别更优的噪声候选,不仅能够评估当前步骤生成帧的质量,还能通过参考先前多片段中的锚帧来保持高级别物体特征,从而提供长期价值。我们的分析表明,扩散模型本身具备通过调整去噪步数实现计算灵活调节的能力,即使在奖励信号引导下的单步去噪方法也能产生显著的长期效益。基于该观察,我们提出ScalingNoise——一种即插即用的推理时搜索策略,通过为扩散采样过程识别黄金初始噪声来提升全局内容一致性与视觉多样性。具体而言,我们执行单步去噪将初始噪声转换为视频片段,随后利用以已生成内容为锚点的奖励模型评估其长期价值。此外,为保持多样性,我们从倾斜噪声分布中采样候选噪声,该分布会加权有潜力的噪声。通过这种方式,ScalingNoise显著降低了噪声引起的误差,确保了更具连贯性与时空一致性的视频生成。在基准数据集上的大量实验表明,所提出的ScalingNoise能有效提升长视频生成质量。