In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference. The DiT-MoE includes two simple designs: shared expert routing and expert-level balance loss, thereby capturing common knowledge and reducing redundancy among the different routed experts. When applied to conditional image generation, a deep analysis of experts specialization gains some interesting observations: (i) Expert selection shows preference with spatial position and denoising time step, while insensitive with different class-conditional information; (ii) As the MoE layers go deeper, the selection of experts gradually shifts from specific spacial position to dispersion and balance. (iii) Expert specialization tends to be more concentrated at the early time step and then gradually uniform after half. We attribute it to the diffusion process that first models the low-frequency spatial information and then high-frequency complex information. Based on the above guidance, a series of DiT-MoE experimentally achieves performance on par with dense networks yet requires much less computational load during inference. More encouragingly, we demonstrate the potential of DiT-MoE with synthesized image data, scaling diffusion model at a 16.5B parameter that attains a new SoTA FID-50K score of 1.80 in 512$\times$512 resolution settings. The project page: https://github.com/feizc/DiT-MoE.
翻译:本文提出了DiT-MoE,一种稀疏化的扩散Transformer架构,该架构在保持可扩展性的同时与稠密网络性能相当,并展现出高度优化的推理效率。DiT-MoE包含两项简洁设计:共享专家路由机制与专家级平衡损失函数,从而有效捕获不同路由专家间的共性知识并降低冗余。在条件图像生成任务中,对专家特化现象的深入分析得出若干重要发现:(i)专家选择呈现对空间位置与去噪时间步的偏好,而对不同类别条件信息不敏感;(ii)随着MoE层深度增加,专家选择逐渐从特定空间位置集中模式转向分散平衡模式;(iii)专家特化在早期时间步趋于集中,过半后逐渐均匀化。我们将其归因于扩散过程先建模低频空间信息、后处理高频复杂信息的特性。基于上述指导原则,系列DiT-MoE模型实验表明其性能与稠密网络持平,同时显著降低推理计算负载。更令人鼓舞的是,我们通过合成图像数据验证了DiT-MoE的扩展潜力,成功构建了165亿参数的扩散模型,在512×512分辨率设置下取得了1.80的FID-50K新SOTA指标。项目页面:https://github.com/feizc/DiT-MoE。