The remarkable capabilities of pretrained image diffusion models have been utilized not only for generating fixed-size images but also for creating panoramas. However, naive stitching of multiple images often results in visible seams. Recent techniques have attempted to address this issue by performing joint diffusions in multiple windows and averaging latent features in overlapping regions. However, these approaches, which focus on seamless montage generation, often yield incoherent outputs by blending different scenes within a single image. To overcome this limitation, we propose SyncDiffusion, a plug-and-play module that synchronizes multiple diffusions through gradient descent from a perceptual similarity loss. Specifically, we compute the gradient of the perceptual loss using the predicted denoised images at each denoising step, providing meaningful guidance for achieving coherent montages. Our experimental results demonstrate that our method produces significantly more coherent outputs compared to previous methods (66.35% vs. 33.65% in our user study) while still maintaining fidelity (as assessed by GIQA) and compatibility with the input prompt (as measured by CLIP score). We further demonstrate the versatility of our method across three plug-and-play applications: layout-guided image generation, conditional image generation and 360-degree panorama generation. Our project page is at https://syncdiffusion.github.io.
翻译:预训练图像扩散模型的卓越能力不仅用于生成固定尺寸图像,还应用于创建全景图。然而,简单拼接多张图像常导致可见接缝。近期技术尝试通过多窗口联合扩散并平均重叠区域的潜在特征来解决这一问题,但这些聚焦于无缝蒙太奇生成的方法往往因混合不同场景而产生不连贯输出。为克服这一局限,我们提出SyncDiffusion——一种即插即用模块,通过感知相似性损失的梯度下降同步多个扩散过程。具体而言,我们在每个去噪步骤中利用预测的去噪图像计算感知损失的梯度,为生成连贯蒙太奇提供有效指导。实验结果表明,相较于此前方法,我们的方法生成了显著更连贯的输出(用户研究中66.35%对33.65%),同时仍保持图像保真度(通过GIQA评估)及与输入提示的兼容性(通过CLIP分数衡量)。我们进一步展示了该方法在三种即插即用应用场景中的通用性:布局引导的图像生成、条件图像生成以及360度全景图生成。项目页面位于https://syncdiffusion.github.io。