Diffusion Large Language Models (dLLMs) introduce iterative denoising to enable parallel token generation, but their sampling phase displays fundamentally different characteristics compared to GEMM-centric transformer layers. Profiling on modern GPUs reveals that sampling can account for up to 70% of total model inference latency-primarily due to substantial memory loads and writes from vocabulary-wide logits, reduction-based token selection, and iterative masked updates. These processes demand large on-chip SRAM and involve irregular memory accesses that conventional NPUs struggle to handle efficiently. To address this, we identify a set of critical instructions that an NPU architecture must specifically optimize for dLLM sampling. Our design employs lightweight non-GEMM vector primitives, in-place memory reuse strategies, and a decoupled mixed-precision memory hierarchy. Together, these optimizations deliver up to a 2.53x speedup over the NVIDIA RTX A6000 GPU under an equivalent nm technology node. We also open-source our cycle-accurate simulation and post-synthesis RTL verification code, confirming functional equivalence with current dLLM PyTorch implementations.
翻译:扩散大语言模型(dLLMs)通过引入迭代去噪机制实现了并行令牌生成,但其采样阶段展现出与以GEMM为中心的Transformer层根本不同的特性。在现代GPU上的性能分析表明,采样过程可占据模型总推理延迟的70%——这主要源于全词汇表对数概率的大规模内存读写、基于规约的令牌选择以及迭代式掩码更新。这些过程需要大量片上SRAM,并涉及非常规内存访问模式,使得传统NPU难以高效处理。为解决此问题,我们识别出一组NPU架构必须专门针对dLLM采样进行优化的关键指令。我们的设计采用轻量级非GEMM向量原语、原位内存复用策略以及解耦混合精度内存层次结构。综合这些优化措施,在等效纳米技术节点下,相较于NVIDIA RTX A6000 GPU可实现最高2.53倍的加速比。我们同时开源了周期精确模拟器与综合后RTL验证代码,确认其与当前dLLM PyTorch实现保持功能等效性。