This work presents a 55nm speculative decoding-based LLM accelerator with bumping-based face-to-face ReRAM-on-logic stacking technology. It features a local rotation unit for outlier-free low-bit quantization, a stacking-aware PNM architecture co-designed with blockwise vector quantization to reduce weight EMA overheads, and an adaptive parallel speculative decoding scheme with an out-of-order scheduler for high resource and bandwidth utilization. Our chip achieves 14.08-to-135.69token/s and 4.46-to-7.17x speedup over vanilla speculative decoding.
翻译:本文提出一款基于55nm工艺的推测解码大语言模型加速器,采用凸点键合面对面式ReRAM-逻辑层堆叠技术。该芯片通过局部旋转单元实现无异常低比特量化,设计堆叠感知型PNM架构并协同块式向量量化以降低权重EMA开销,同时提出自适应并行推测解码方案及乱序调度器以提高资源与带宽利用率。实测结果显示,该芯片可实现14.08至135.69 Token/s的推理速度,相较传统推测解码获得4.46至7.17倍加速比。