Diffusion models have demonstrated remarkable performance in text-to-image synthesis, producing realistic and high resolution images that faithfully adhere to the corresponding text-prompts. Despite their great success, they still fall behind in sketch-to-image synthesis tasks, where in addition to text-prompts, the spatial layout of the generated images has to closely follow the outlines of certain reference sketches. Employing an MLP latent edge predictor to guide the spatial layout of the synthesized image by predicting edge maps at each denoising step has been recently proposed. Despite yielding promising results, the pixel-wise operation of the MLP does not take into account the spatial layout as a whole, and demands numerous denoising iterations to produce satisfactory images, leading to time inefficiency. To this end, we introduce U-Sketch, a framework featuring a U-Net type latent edge predictor, which is capable of efficiently capturing both local and global features, as well as spatial correlations between pixels. Moreover, we propose the addition of a sketch simplification network that offers the user the choice of preprocessing and simplifying input sketches for enhanced outputs. The experimental results, corroborated by user feedback, demonstrate that our proposed U-Net latent edge predictor leads to more realistic results, that are better aligned with the spatial outlines of the reference sketches, while drastically reducing the number of required denoising steps and, consequently, the overall execution time.
翻译:扩散模型在文本到图像合成中展现出卓越性能,能够生成忠实遵循对应文本提示的高质量、高分辨率图像。尽管取得了巨大成功,但这类模型在草图到图像合成任务中仍显不足——该任务除文本提示外,还需生成图像的空间布局严格遵循参考草图的轮廓。近期有研究提出采用MLP隐空间边缘预测器,通过在每个去噪步骤预测边缘图来引导合成图像的空间布局。尽管该方法取得了有前景的结果,但MLP的逐像素运算未能整体考量空间布局,且需要大量去噪迭代才能生成令人满意的图像,导致时间效率低下。为此,我们提出U-Sketch框架,其核心是采用U-Net型隐空间边缘预测器,该预测器能高效捕获局部与全局特征及像素间的空间相关性。此外,我们新增了草图简化网络,为用户提供预处理和简化输入草图的可选方案,以增强输出效果。实验结果表明(经用户反馈验证),我们提出的U-Net隐空间边缘预测器能生成更逼真、更贴合参考草图空间轮廓的结果,同时显著减少所需去噪步数,从而大幅降低整体执行时间。