Large language models achieve impressive performance across diverse tasks but exhibit high inference latency due to their large parameter sizes. While quantization reduces model size, it often leads to performance degradation compared to the full model. Speculative decoding remains lossless but typically incurs extra overheads. We propose SPEQ, an algorithm-hardware co-designed speculative decoding method that uses part of the full-model weight bits to form a quantized draft model, thereby eliminating additional training or storage overhead. A reconfigurable processing element array enables efficient execution of both the draft and verification passes. Experimental results across 15 LLMs and tasks demonstrate that SPEQ achieves speedups of 2.07x, 1.53x, and 1.45x compared over FP16, Olive, and Tender, respectively.
翻译:大语言模型在多样化任务上展现出令人瞩目的性能,但由于其庞大的参数量,推理延迟较高。量化技术虽能减小模型尺寸,但与完整模型相比常导致性能下降。推测式解码虽能保持无损,但通常会产生额外开销。我们提出了SPEQ,一种算法-硬件协同设计的推测式解码方法,该方法利用完整模型权重的一部分比特来构建一个量化草稿模型,从而消除了额外的训练或存储开销。一个可重构的处理单元阵列能够高效执行草稿生成和验证两个阶段。在15个大语言模型和任务上的实验结果表明,与FP16、Olive和Tender相比,SPEQ分别实现了2.07倍、1.53倍和1.45倍的加速。