Adapting pretrained image-based diffusion models to generate temporally consistent videos has become an impactful generative modeling research direction. Training-free noise-space manipulation has proven to be an effective technique, where the challenge is to preserve the Gaussian white noise distribution while adding in temporal consistency. Recently, Chang et al. (2024) formulated this problem using an integral noise representation with distribution-preserving guarantees, and proposed an upsampling-based algorithm to compute it. However, while their mathematical formulation is advantageous, the algorithm incurs a high computational cost. Through analyzing the limiting-case behavior of their algorithm as the upsampling resolution goes to infinity, we develop an alternative algorithm that, by gathering increments of multiple Brownian bridges, achieves their infinite-resolution accuracy while simultaneously reducing the computational cost by orders of magnitude. We prove and experimentally validate our theoretical claims, and demonstrate our method's effectiveness in real-world applications. We further show that our method readily extends to the 3-dimensional space.
翻译:将预训练的基于图像的扩散模型适配于生成时间一致的视频已成为生成建模研究的一个重要方向。免训练噪声空间操纵已被证明是一种有效的技术,其挑战在于在保持高斯白噪声分布的同时引入时间一致性。最近,Chang等人(2024)利用具有分布保持保证的积分噪声表示形式化了该问题,并提出了一种基于上采样的算法进行计算。然而,尽管其数学公式具有优势,该算法却带来了高昂的计算成本。通过分析其上采样分辨率趋于无穷大时算法的极限行为,我们开发了一种替代算法,该算法通过汇集多个布朗桥的增量,实现了其无限分辨率的精度,同时将计算成本降低了数个数量级。我们证明并通过实验验证了我们的理论主张,并展示了我们的方法在现实应用中的有效性。我们进一步表明,我们的方法可以轻松扩展到三维空间。