Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models.
翻译:扩散模型在生成逼真图像方面表现出色,但其训练和采样过程均伴随显著的计算成本。尽管已有多种技术应对这些计算挑战,但一个较少被探索的问题是如何设计高效且适应性强的迭代精炼网络骨干。当前选项(如U-Net和Vision Transformer)通常依赖于资源密集型深层网络,并且缺乏灵活性,难以生成可变分辨率的图像,或使用比训练时更小的网络。本研究引入乐高积木(LEGO bricks),该模块无缝集成了局部特征增强与全局内容编排。这些积木可通过堆叠构建测试时可重配置的扩散骨干,允许在采样时选择性跳过积木以降低采样成本,并生成比训练数据分辨率更高的图像。乐高积木通过MLP增强局部区域,并利用Transformer模块对其进行变换,同时在所有积木中保持一致的完整分辨率图像。实验结果表明,乐高积木提升了训练效率,加速了收敛过程,并在保持强生成性能的同时实现了可变分辨率图像生成。此外,与其它方法相比,LEGO显著减少了采样时间,使其成为扩散模型的一种有价值的增强方案。