Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models.
翻译:扩散模型在生成逼真图像方面表现出色,但其训练与采样过程均伴随显著的计算成本。尽管已有多种技术应对这些计算挑战,设计高效且自适应的迭代细化网络主干这一方向仍探索不足。当前主流的U-Net与Vision Transformer等方案,或依赖资源密集型深度网络,或在生成可变分辨率图像、或使用规模小于训练阶段的网络时缺乏灵活性。本研究提出"乐高积木"(LEGO bricks)——一种无缝融合局部特征增强(Local-feature Enrichment)与全局内容编排(Global-content Orchestration)的模块。这些模块可通过堆叠构建测试时可重构的扩散主干,并支持选择性跳转以降低采样成本及生成分辨率高于训练数据的图像。LEGO砖块利用MLP增强局部区域特征,再通过Transformer模块进行变换,同时所有砖块间保持一致的完整分辨率图像。实验结果表明,LEGO砖块可提升训练效率、加速收敛,并在保持强大生成性能的同时实现变分辨率图像生成。此外,相比其他方法,LEGO显著缩短了采样时间,成为扩散模型的高价值增强组件。