Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a "free lunch" acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4$\times$ and 1.5$\times$ throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.
翻译:扩散大语言模型(dLLM)已成为自回归(AR)模型的有力替代方案,通过并行块级解码实现更优硬件利用率与双向上下文建模。然而,随着dLLM采用混合专家(MoE)架构持续扩展规模,其在资源受限设备上的部署仍面临挑战。现有基于AR的方法通常会导致高昂的I/O开销或显著的计算瓶颈。本文提出TIDE,一种新型资源高效推理系统,利用扩散过程中块内专家激活的时序稳定性。具体而言,我们利用扩散过程中块内专家激活的时序稳定性,引入基于间隔的专家刷新策略,以I/O感知方式更新专家放置。为确保最优性能,我们将推理调度形式化为数学规划问题,求解最小化I/O流量与CPU计算的最优间隔。最重要的是,TIDE是无损优化方案,无需模型训练,为dLLM推理提供"免费午餐"式加速。在单GPU-CPU系统中,我们证明TIDE在LLaDA2.0-mini与LLaDA2.0-flash模型上分别实现相比先前基线最高1.4倍和1.5倍的吞吐量提升。