Coarse-guided visual generation, which synthesizes fine visual samples from degraded or low-fidelity coarse references, is essential for various real-world applications. While training-based approaches are effective, they are inherently limited by high training costs and restricted generalization due to paired data collection. Accordingly, recent training-free works propose to leverage pretrained diffusion models and incorporate guidance during the sampling process. However, these training-free methods either require knowing the forward (fine-to-coarse) transformation operator, e.g., bicubic downsampling, or are difficult to balance between guidance and synthetic quality. To address these challenges, we propose a novel guided method by using the h-transform, a tool that can constrain stochastic processes (e.g., sampling process) under desired conditions. Specifically, we modify the transition probability at each sampling timestep by adding to the original differential equation with a drift function, which approximately steers the generation toward the ideal fine sample. To address unavoidable approximation errors, we introduce a noise-level-aware schedule that gradually de-weights the term as the error increases, ensuring both guidance adherence and high-quality synthesis. Extensive experiments across diverse image and video generation tasks demonstrate the effectiveness and generalization of our method.
翻译:粗粒度引导视觉生成旨在从退化或低保真度的粗粒度参考中合成精细的视觉样本,对于众多现实应用至关重要。尽管基于训练的方法行之有效,但其本质上受限于高昂的训练成本以及配对数据收集导致的泛化能力受限。因此,近期的免训练研究工作提出利用预训练的扩散模型,并在采样过程中融入引导信息。然而,这些免训练方法要么需要已知前向(精细到粗粒度)变换算子(例如双三次下采样),要么难以在引导效果与合成质量之间取得平衡。为应对这些挑战,我们提出一种新颖的引导方法,利用h变换——一种能够在期望条件下约束随机过程(例如采样过程)的工具。具体而言,我们通过向原始微分方程添加一个漂移函数来修改每个采样时间步的转移概率,该函数可近似地将生成过程导向理想的精细样本。为解决不可避免的近似误差,我们引入一种噪声水平感知调度策略,该策略随着误差增大逐渐降低该项的权重,从而同时确保引导依从性与高质量的合成效果。在多样化的图像与视频生成任务上进行的大量实验验证了本方法的有效性与泛化能力。