Large Language Models (LLMs) exhibit high inference latency due to their autoregressive decoding nature. While the draft head in speculative decoding mitigates this issue, its full potential remains unexplored. In this paper, we introduce KOALA (K-layer Optimized Adversarial Learning Architecture), an orthogonal approach to the draft head. By transforming the conventional single-layer draft head into a multi-layer architecture and incorporating adversarial learning into the traditional supervised training, KOALA significantly improves the accuracy of the draft head in predicting subsequent tokens, thus more closely mirroring the functionality of LLMs. Although this improvement comes at the cost of slightly increased drafting overhead, KOALA substantially unlocks the draft head's potential, greatly enhancing speculative decoding. We conducted comprehensive evaluations of KOALA, including both autoregressive and non-autoregressive draft heads across various tasks, demonstrating a latency speedup ratio improvement of 0.24x-0.41x, which is 10.57%-14.09% faster than the original draft heads.
翻译:大型语言模型(LLM)因其自回归解码特性而存在较高的推理延迟。尽管推测解码中的草稿头缓解了这一问题,但其全部潜力尚未得到充分挖掘。本文提出KOALA(K层优化对抗性学习架构),这是一种与草稿头正交的改进方法。通过将传统的单层草稿头转变为多层架构,并在传统监督训练中引入对抗性学习,KOALA显著提升了草稿头预测后续词元的准确性,从而更贴近地模拟LLM的功能。尽管这一改进以略微增加草稿生成开销为代价,但KOALA充分释放了草稿头的潜力,极大增强了推测解码性能。我们对KOALA进行了全面评估,包括在不同任务中测试自回归与非自回归草稿头,实验表明延迟加速比提升了0.24倍至0.41倍,较原始草稿头提速10.57%至14.09%。