Despite the computational efficiency of MoE models, the excessive memory footprint and I/O overhead inherent in multi-expert architectures pose formidable challenges for real-time inference on resource-constrained edge platforms. While existing static methods struggle with a rigid latency-accuracy trade-off, we observe that expert importance is highly skewed and depth-dependent. Motivated by these insights, we propose DyMoE, a dynamic mixed-precision quantization framework designed for high-performance edge inference. Leveraging insights into expert importance skewness and depth-dependent sensitivity, DyMoE introduces: (1) importance-aware prioritization to dynamically quantize experts at runtime; (2) depth-adaptive scheduling to preserve semantic integrity in critical layers; and (3) look-ahead prefetching to overlap I/O stalls. Experimental results on commercial edge hardware show that DyMoE reduces Time-to-First-Token (TTFT) by 3.44x-22.7x and up to a 14.58x speedup in Time-Per-Output-Token (TPOT) compared to state-of-the-art offloading baselines, enabling real-time, accuracy-preserving MoE inference on resource-constrained edge devices.
翻译:尽管MoE模型具有计算效率优势,但多专家架构固有的过大内存占用和I/O开销对资源受限边缘平台上的实时推理构成了严峻挑战。现有静态方法难以在刚性延迟-精度权衡中取得突破,而我们观察到专家重要性具有高度偏态分布和深度依赖性特征。基于这些发现,我们提出DyMoE——一种面向高性能边缘推理的动态混合精度量化框架。通过利用专家重要性偏态特性和深度依赖敏感性,DyMoE引入:(1) 重要性感知优先级排序机制,运行时动态量化专家;(2) 深度自适应调度策略,保障关键层的语义完整性;(3) 前瞻预取技术,重叠I/O延迟。在商用边缘硬件上的实验结果表明,与最先进的卸载基线相比,DyMoE将首令牌延迟(TTFT)降低3.44倍至22.7倍,每输出令牌时间(TPOT)加速比高达14.58倍,从而在资源受限边缘设备上实现了保持精度的实时MoE推理。