Large language models operate in distinct compute-bound prefill followed by memory bandwidth-bound decode phases. Hybrid Mamba-Transformer models inherit this asymmetry while adding state space model (SSM) recurrences and element-wise operations that map poorly to matmul-centric accelerators. This mismatch causes performance bottlenecks, showing that a homogeneous architecture cannot satisfy all requirements. We introduce DUET, a disaggregated accelerator that assigns prefill and decode phases to specialized packages. The Prefill package utilizes systolic array chiplets with off-package memory for efficient large matrix multiplications and long-sequence SSMs. The Decode package utilizes vector-unit arrays with high-bandwidth in-package memory to accelerate token-by-token SSM and vector-matrix multiplications. Both architectures are runtime-configurable to support hybrid models with mixed Mamba and attention layers. Evaluations on Nemotron-H-56B, Zamba2-7B, and Llama3-8B across four workloads show that DUET achieves 4x faster time to first token, 1.4x higher throughput, and 1.5x lower time between tokens over the B200 GPU.
翻译:大语言模型在运行中经历两个截然不同的阶段:先是以计算为瓶颈的预填充阶段,随后是以内存带宽为瓶颈的解码阶段。混合Mamba-Transformer模型继承了这种不对称性,同时引入了状态空间模型(SSM)的递归运算和逐元素操作,这些操作难以有效映射到以矩阵乘法为核心设计的加速器上。这种不匹配导致了性能瓶颈,表明单一同构架构无法满足所有需求。我们提出了DUET,一种解耦式加速器,它将预填充和解码阶段分别分配给专用模块。预填充模块采用脉动阵列芯粒搭配片外内存,以高效执行大规模矩阵乘法和长序列SSM运算。解码模块则采用向量单元阵列搭配高带宽片内内存,以加速逐令牌的SSM和向量-矩阵乘法运算。两种架构均支持运行时配置,以适应混合了Mamba层和注意力层的模型。在Nemotron-H-56B、Zamba2-7B和Llama3-8B模型上,针对四种工作负载的评估表明,与B200 GPU相比,DUET实现了4倍的首令牌生成加速、1.4倍的吞吐量提升以及1.5倍的令牌间延迟降低。