Mixture-of-experts-based (MoE-based) diffusion models have shown their scalability and ability to generate high-quality images, making them a promising choice for efficient model scaling. However, they rely on expert parallelism across GPUs, necessitating efficient parallelism optimization. While state-of-the-art diffusion parallel inference methods overlap communication and computation via displaced operations, they introduce substantial staleness -- the utilization of outdated activations, which is especially severe in expert parallelism scenarios and leads to significant performance degradation. We identify this staleness issue and propose DICE, a staleness-centric optimization with a three-fold approach: (1) Interweaved Parallelism reduces step-level staleness for free while overlapping communication and computation; (2) Selective Synchronization operates at layer-level and protects critical layers vulnerable from staled activations; and (3) Conditional Communication, a token-level, training-free method that dynamically adjusts communication frequency based on token importance. Together, these optimizations effectively reduce staleness, achieving up to 1.2x speedup with minimal quality degradation. Our results establish DICE as an effective, scalable solution for large-scale MoE-based diffusion model inference.
翻译:基于专家混合(MoE)的扩散模型已展现出其可扩展性和生成高质量图像的能力,成为高效模型扩展的有前景选择。然而,这类模型依赖跨GPU的专家并行计算,需要高效的并行优化策略。尽管当前最先进的扩散并行推理方法通过错位操作实现了通信与计算的重叠,但这种方法引入了显著的陈旧性问题——即使用过时的激活值,该问题在专家并行场景中尤为严重,并导致明显的性能下降。本文识别了这一陈旧性问题,并提出DICE——一种以陈旧性为核心的三重优化方案:(1)交织并行策略在重叠通信与计算的同时,无代价地降低步骤级陈旧性;(2)选择性同步机制在层级运行,保护易受陈旧激活值影响的关键层;(3)条件通信是一种基于令牌级的免训练方法,可根据令牌重要性动态调整通信频率。这些优化措施共同有效降低了陈旧性影响,在保证质量损失最小的前提下实现了最高1.2倍的加速效果。我们的实验结果表明,DICE为大规模基于MoE的扩散模型推理提供了一种高效且可扩展的解决方案。