Sparse Mixture of Experts (MoE) models, while outperforming dense Large Language Models (LLMs) in terms of performance, face significant deployment challenges during inference due to their high memory demands. Existing offloading techniques, which involve swapping activated and idle experts between the GPU and CPU, often suffer from rigid expert caching mechanisms. These mechanisms fail to adapt to dynamic routing, leading to inefficient cache utilization, or incur prohibitive costs for prediction training. To tackle these inference-specific challenges, we introduce ExpertFlow, a comprehensive system specifically designed to enhance inference efficiency by accommodating flexible routing and enabling efficient expert scheduling between CPU and GPU. This reduces overhead and boosts system performance. Central to our approach is a predictive routing path-based offloading mechanism that utilizes a lightweight predictor to accurately forecast routing paths before computation begins. This proactive strategy allows for real-time error correction in expert caching, significantly increasing cache hit ratios and reducing the frequency of expert transfers, thereby minimizing I/O overhead. Additionally, we implement a dynamic token scheduling strategy that optimizes MoE inference by rearranging input tokens across different batches. This method not only reduces the number of activated experts per batch but also improves computational efficiency. Our extensive experiments demonstrate that ExpertFlow achieves up to 93.72\% GPU memory savings and enhances inference speed by 2 to 10 times compared to baseline methods, highlighting its effectiveness and utility as a robust solution for resource-constrained inference scenarios.
翻译:稀疏混合专家(MoE)模型虽然在性能上超越了密集型大语言模型(LLM),但由于其高内存需求,在推理部署中面临显著挑战。现有的卸载技术通过在GPU与CPU之间交换激活与闲置的专家,常受限于僵化的专家缓存机制。这些机制无法适应动态路由,导致缓存利用率低下,或需承担高昂的预测训练成本。为应对这些推理特有的挑战,我们提出了ExpertFlow,这是一个专门设计的完整系统,旨在通过适配灵活路由并实现CPU与GPU间的高效专家调度来提升推理效率,从而降低开销并提升系统性能。我们方法的核心是一种基于预测路由路径的卸载机制,该机制利用轻量级预测器在计算开始前准确预测路由路径。这种主动策略允许对专家缓存进行实时纠错,显著提高缓存命中率,减少专家传输频率,从而最小化I/O开销。此外,我们实现了一种动态令牌调度策略,通过在不同批次间重新排列输入令牌来优化MoE推理。该方法不仅减少了每批次激活的专家数量,还提升了计算效率。我们的大量实验表明,与基线方法相比,ExpertFlow最高可节省93.72%的GPU内存,并将推理速度提升2至10倍,这突显了其作为资源受限推理场景下稳健解决方案的有效性与实用性。