Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. Code and models will be made publicly available.
翻译:混合专家模型已成为在保持计算效率的同时扩展模型容量的强大范式。尽管在大型语言模型中取得了显著成功,但现有将MoE应用于扩散Transformer的尝试仅获得有限增益。我们将此差距归因于语言与视觉token之间的根本差异:语言token具有语义密集性和显著的token间差异,而视觉token则表现出空间冗余性和功能异质性,这阻碍了视觉MoE中的专家专业化。为此,我们提出ProMoE——一种配备两步路由器的MoE框架,通过显式路由指导促进专家专业化。具体而言,该指导通过条件路由根据功能角色将图像token划分为条件集和无条件集,并通过基于语义内容的可学习原型进行原型路由,细化条件图像token的分配。此外,原型路由实现的潜在空间基于相似性的专家分配,为融入显式语义指导提供了自然机制,我们验证了此类指导对视觉MoE至关重要。在此基础上,我们提出路由对比损失,显式增强原型路由过程,促进专家内部一致性与专家间多样性。在ImageNet基准上的大量实验表明,ProMoE在整流流和DDPM训练目标下均超越现有最优方法。代码与模型将公开提供。