Diffusion-based LLMs (dLLMs) fundamentally depart from traditional autoregressive (AR) LLM inference: they leverage bidirectional attention, block-wise KV cache refreshing, cross-step reuse, and a non-GEMM-centric sampling phase. These characteristics make current dLLMs incompatible with most existing NPUs, as their inference patterns, in particular the reduction-heavy, top-$k$-driven sampling stage, demand new ISA and memory hierarchy support beyond that of AR accelerators. In addition, the blocked diffusion KV cache breaks from the append-only paradigm assumed by AR NPUs, and conventional AR-derived KV quantization schemes were designed for static activation distributions and do not account for the step-wise distribution shifts introduced by iterative block-wise refinement in dLLMs. In this paper, we introduce the first NPU accelerator specifically designed for dLLMs. It delivers: a dLLM-oriented ISA and compiler; a hardware-optimized execution model for both the transformer inference and diffusion sampling used in dLLMs; a novel Block-Adaptive Online Smoothing (BAOS) for quantizing KV cache in dLLMs; and a complete RTL implementation synthesized in 7nm. To evaluate and validate our design, we introduce a tri-path simulation framework that comprises analytical, cycle-accurate, and accuracy simulators, together with cross-validations against physical hardware. The full NPU stack, including ISA, simulation tools, and quantization software, will be open-sourced upon acceptance.
翻译:基于扩散的大语言模型(dLLMs)从根本上背离了传统自回归(AR)大语言模型的推理方式:它们采用双向注意力机制、分块KV缓存刷新、跨步骤复用以及非以通用矩阵乘法(GEMM)为核心的采样阶段。这些特性使得当前dLLMs与大多数现有NPU不兼容,因为其推理模式(尤其是以规约运算为主、由top-$k$驱动的采样阶段)需要超越AR加速器的新指令集架构(ISA)和存储层次支持。此外,分块式扩散KV缓存打破了AR NPU所依赖的仅追加(append-only)范式,而传统的基于AR的KV量化方案针对静态激活分布设计,未能考虑dLLMs中迭代分块精炼所引入的逐步骤分布偏移。本文提出了首个专为dLLMs设计的NPU加速器。它提供了:面向dLLMs的ISA与编译器;针对dLLMs中Transformer推理和扩散采样的硬件优化执行模型;一种新颖的用于dLLMs中KV缓存量化的分块自适应在线平滑方法(BAOS);以及基于7nm工艺综合的完整RTL实现。为评估和验证我们的设计,我们引入了包含分析模拟器、周期精确模拟器和精度模拟器的三路径仿真框架,并结合物理硬件进行交叉验证。完整的NPU技术栈(包括ISA、仿真工具和量化软件)将在论文接收后开源。