Autoregressive language models demonstrate excellent performance in various scenarios. However, the inference efficiency is limited by its one-step-one-word generation mode, which has become a pressing problem recently as the models become increasingly larger. Speculative decoding employs a "draft and then verify" mechanism to allow multiple tokens to be generated in one step, realizing lossless acceleration. Existing methods mainly adopt fixed heuristic draft structures, which fail to adapt to different situations to maximize the acceptance length during verification. To alleviate this dilemma, we proposed OPT-Tree, an algorithm to construct adaptive and scalable draft trees. It searches the optimal tree structure that maximizes the mathematical expectation of the acceptance length in each decoding step. Experimental results reveal that OPT-Tree outperforms the existing draft structures and achieves a speed-up ratio of up to 3.2 compared with autoregressive decoding. If the draft model is powerful enough and the node budget is sufficient, it can generate more than ten tokens in a single step. Our code is available at https://github.com/Jikai0Wang/OPT-Tree.
翻译:自回归语言模型在多种场景下展现出卓越性能。然而,其“一步一词”的生成模式限制了推理效率,随着模型规模日益增大,这一问题变得尤为突出。推测解码采用“先草稿后验证”机制,允许单步生成多个词元,实现无损加速。现有方法主要采用固定的启发式草稿结构,无法适应不同情境以最大化验证阶段的接受长度。为缓解这一困境,我们提出了OPT-Tree算法,用于构建自适应且可扩展的草稿树。该算法通过搜索最优树结构,使每个解码步骤中接受长度的数学期望最大化。实验结果表明,OPT-Tree优于现有草稿结构,相比自回归解码最高可实现3.2倍的加速比。若草稿模型足够强大且节点预算充足,单步可生成超过十个词元。代码已开源:https://github.com/Jikai0Wang/OPT-Tree。