Mixture-of-Experts is a promising approach for edge AI with low-batch inference. Yet, on-device deployments often face limited on-chip memory and severe workload imbalance; the prevalent use of offloading further incurs off-chip memory access bottlenecks. Moreover, MoE sparsity and dynamic gating shift distributed strategies toward much finer granularity and introduce runtime scheduling considerations. Recently, high die-to-die bandwidth chiplet interconnects have created new opportunities for multi-chiplet systems to address workload imbalance and offloading bottlenecks with fine-grained scheduling. In this paper, we propose Fully Sharded Expert Data Parallelism, a parallelization paradigm specifically architected for low-batch MoE inference on multi-chiplet accelerators. FSE-DP attains adaptive computation-communication overlap and balanced load by orchestrating fine-grained, complementary expert streams along dynamic trajectories across high-bandwidth D2D links. The attendant dataflow complexity is tamed by a minimal, hardware-amenable set of virtualization rules and a lightweight scheduling algorithm. Our approach achieves 1.22 to 2.00 times speedup over state-of-the-art baselines and saves up to 78.8 percent on-chip memory.
翻译:混合专家模型是面向低批量推理的边缘AI领域中一种极具前景的方案。然而,在设备端部署时往往面临片上内存受限与工作负载严重不均的挑战;外存卸载的普遍采用进一步引发了片外存储访问瓶颈。此外,MoE的稀疏性与动态门控机制使得分布式策略向更细粒度方向发展,并引入了运行时调度考量。近期,高片间带宽的芯粒互连技术为多芯粒系统通过细粒度调度解决工作负载不均与卸载瓶颈创造了新机遇。本文提出全分片专家数据并行(Fully Sharded Expert Data Parallelism),该并行化范式专门针对多芯粒加速器上的低批量MoE推理设计。FSE-DP通过在高带宽片间链路上沿动态轨迹编排细粒度、互补的专家流,实现了自适应的计算-通信重叠与负载均衡。其伴随的数据流复杂度通过一组最简且适配硬件的虚拟化规则与轻量级调度算法得到有效控制。相较于最先进的基线方法,本方法实现1.22至2.00倍的加速比,并节省高达78.8%的片上内存。