Diffusion-based generative models excel in perceptually impressive synthesis but face challenges in interpretability. This paper introduces ToddlerDiffusion, an interpretable 2D diffusion image-synthesis framework inspired by the human generation system. Unlike traditional diffusion models with opaque denoising steps, our approach decomposes the generation process into simpler, interpretable stages; generating contours, a palette, and a detailed colored image. This not only enhances overall performance but also enables robust editing and interaction capabilities. Each stage is meticulously formulated for efficiency and accuracy, surpassing Stable-Diffusion (LDM). Extensive experiments on datasets like LSUN-Churches and COCO validate our approach, consistently outperforming existing methods. ToddlerDiffusion achieves notable efficiency, matching LDM performance on LSUN-Churches while operating three times faster with a 3.76 times smaller architecture. Our source code is provided in the supplementary material and will be publicly accessible.
翻译:基于扩散的生成模型在生成令人印象深刻的感知质量方面表现出色,但在可解释性方面面临挑战。本文提出ToddlerDiffusion,一个受人类生成系统启发的可解释二维扩散图像合成框架。与具有不透明去噪步骤的传统扩散模型不同,我们的方法将生成过程分解为更简单、可解释的阶段:生成轮廓、调色板以及精细彩色图像。这不仅提升了整体性能,还实现了强大的编辑和交互能力。每个阶段都经过精心设计以确保效率和准确性,性能超越Stable-Diffusion(LDM)。在LSUN-Churches和COCO等数据集上的大量实验验证了我们的方法,其持续优于现有方法。ToddlerDiffusion实现了显著的高效性,在LSUN-Churches上达到与LDM相当的性能,同时运行速度快三倍,架构规模缩小3.76倍。我们的源代码已在补充材料中提供,并将公开访问。