Mixture-of-Experts (MoE) has become a practical architecture for scaling LLM capacity while keeping per-token compute modest, but deploying MoE models on a single, memory-limited GPU remains difficult because expert weights dominate the HBM footprint. Existing expert offloading and prefetching systems reduce the resident set, yet they often pay expert-loading costs on the critical path when activation becomes dense. Post-training quantization (PTQ) lowers the footprint without transfers, but prevailing pipelines fix expert bit-widths offline and assume routing remains stable, even though MoE expert utilization is heavy-tailed and the hot set can shift across workloads. We present DynaExq, a runtime-aware mixed-precision serving system that treats single-GPU MoE inference under a hard HBM envelope as an online, budget-constrained precision allocation problem. The key insight is to keep the experts that dominate runtime traffic resident at higher precision, while maintaining a low-precision fallback for the remaining experts, so the system can reduce transfer volume and avoid the waiting latency that limits offloading and prefetching under dense activation. DynaExq estimates long-horizon expert hotness from router traces, selects a per-layer high-precision resident set via a budget-feasible top-$n$ rule, and applies promotions and demotions asynchronously through stable expert handles so the forward pass always executes on a fully materialized expert version. Across Qwen3-MoE-30B/80B and six benchmarks, DynaExq improves accuracy over static PTQ on Qwen3-80B (73.09% to 77.57%) under comparable device-memory budgets and achieves up to 2.73x higher throughput than offloading/prefetch baselines at batch size 32.
翻译:混合专家(Mixture-of-Experts, MoE)架构已成为扩展大语言模型容量同时保持每令牌计算量适中的实用方案,但在单个内存受限的GPU上部署MoE模型仍然困难,因为专家权重占据了高带宽内存(HBM)的主要空间。现有的专家卸载与预取系统减少了常驻内存的专家集合,但在激活变得密集时,它们往往需要在关键路径上承担专家加载开销。训练后量化(Post-training Quantization, PTQ)无需数据传输即可降低内存占用,但主流流程固定了离线专家位宽,并假设路由模式保持稳定,而实际上MoE专家的使用呈现重尾分布,且热点专家集合会随工作负载变化。本文提出DynaExq,一种运行时感知的混合精度服务系统,将硬HBM限制下的单GPU MoE推理视为在线、预算约束的精度分配问题。其核心思想是:对主导运行时流量的专家保持较高精度常驻,同时对剩余专家维持低精度后备,从而减少传输量,避免在密集激活下限制卸载与预取性能的等待延迟。DynaExq通过路由轨迹预估长时专家热度,采用预算可行的top-$n$规则逐层选择高精度常驻集合,并通过稳定的专家句柄异步执行精度提升与降级,确保前向传播始终在完全实例化的专家版本上执行。在Qwen3-MoE-30B/80B模型和六个基准测试中,DynaExq在可比设备内存预算下,将Qwen3-80B的准确率从静态PTQ的73.09%提升至77.57%,并在批次大小为32时实现了最高2.73倍于卸载/预取基线的吞吐量。