Existing speculative decoding methods typically require additional model structure and training processes to assist the model for draft token generation. This makes the migration of acceleration methods to the new model more costly and more demanding on device memory. To address this problem, we propose the Make Some Noise (MSN) training framework as a replacement for the supervised fine-tuning stage of the large language model. The training method simply introduces some noise at the input for the model to learn the denoising task. It significantly enhances the parallel decoding capability of the model without affecting the original task capability. In addition, we propose a tree-based retrieval-augmented Jacobi (TR-Jacobi) decoding strategy to further improve the inference speed of MSN models. Experiments in both the general and code domains have shown that MSN can improve inference speed by 2.3-2.7x times without compromising model performance. The MSN model also achieves comparable acceleration ratios to the SOTA model with additional model structure on Spec-Bench.
翻译:现有的推测解码方法通常需要额外的模型结构和训练过程来辅助模型生成草稿令牌。这使得加速方法迁移到新模型的成本更高,且对设备内存的要求更为严苛。为解决这一问题,我们提出"制造一些噪声"训练框架,用以替代大语言模型的监督微调阶段。该训练方法仅在输入中引入少量噪声,使模型学习去噪任务。它在不影响原始任务能力的前提下,显著增强了模型的并行解码能力。此外,我们提出基于树的检索增强雅可比解码策略,以进一步提升MSN模型的推理速度。在通用领域和代码领域的实验表明,MSN能在不损害模型性能的前提下将推理速度提升2.3-2.7倍。在Spec-Bench基准测试中,MSN模型仅凭单一模型结构就达到了与配备额外模型结构的最先进模型相当的加速比。