Deploying large-scale MoE models presents challenges in memory capacity and bandwidth for expert activation. While Attention-FFN Disaggregation (AFD) has emerged as a potential architecture to decouple compute and memory resources, its performance boundaries compared to standard large-scale Expert Parallelism (EP) remain underexplored. In this paper, we conduct a systematic analysis of AFD by extending the roofline model to the communication level, correlating interconnect bandwidth, arithmetic intensity, and Hardware FLOPS Utilization (HFU). Our analysis reveals a dead zone on standard clusters: increasing FFN instance count fails to improve HFU as computational workload is capped by scale-out bandwidth, causing operator active time to shrink relative to the fixed latency budget. We further show that AFD's discrete node-level scaling incurs higher imbalance penalties than EP's continuous batch adjustment. Nevertheless, these limitations diminish under specific conditions: Superpod-class hardware with abundant interconnect bandwidth and models with coarse-grained experts and lower sparsity are more likely to benefit from AFD. These findings position AFD as a promising approach for specific hardware-model combinations rather than a universal solution.
翻译:部署大规模混合专家模型在专家激活方面面临内存容量和带宽的挑战。虽然注意力-前馈网络解耦作为一种解耦计算与内存资源的潜在架构应运而生,但其相较于标准大规模专家并行的性能边界仍未得到充分探索。本文通过将屋顶线模型扩展至通信层面,关联互连带宽、算术强度与硬件浮点运算利用率,对注意力-前馈网络解耦进行了系统性分析。我们的分析揭示了标准集群中存在一个死区:增加前馈网络实例数量无法提升硬件浮点运算利用率,因为计算工作负载受限于横向扩展带宽,导致算子活跃时间在固定延迟预算内相对缩减。我们进一步证明,注意力-前馈网络解耦的离散节点级扩展比专家并行的连续批次调整产生更高的不平衡代价。然而,这些限制在特定条件下会减弱:具备充足互连带宽的超级集群级硬件,以及采用粗粒度专家和较低稀疏度的模型,更可能从注意力-前馈网络解耦中获益。这些发现表明,注意力-前馈网络解耦是特定硬件-模型组合下的可行方案,而非通用解决方案。