Autoregressive sampling from large language models has led to state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks. One way to speed up sampling is $\textit{speculative decoding}$: use a small model to sample a $\textit{draft}$ (block or sequence of tokens), and then score all tokens in the draft by the large language model in parallel. A subset of the tokens in the draft are accepted (and the rest rejected) based on a statistical method to guarantee that the final output follows the distribution of the large model. In this work, we provide a principled understanding of speculative decoding through the lens of optimal transport (OT) with $\textit{membership cost}$. This framework can be viewed as an extension of the well-known $\textit{maximal-coupling}$ problem. This new formulation enables us to generalize the speculative decoding method to allow for a set of $k$ candidates at the token-level, which leads to an improved optimal membership cost. We show that the optimal draft selection algorithm (transport plan) can be computed via linear programming, whose best-known runtime is exponential in $k$. We then propose a valid draft selection algorithm whose acceptance probability is $(1-1/e)$-optimal multiplicatively. Moreover, it can be computed in time almost linear with size of domain of a single token. Using this $new draft selection$ algorithm, we develop a new autoregressive sampling algorithm called $\textit{SpecTr}$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output. We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.
翻译:摘要:基于自回归采样的大语言模型在多项自然语言任务中取得了最先进的结果。然而,自回归采样逐token生成的方式导致速度缓慢,甚至在某些任务中难以应用。加速采样的一个方法是*推测解码*:使用小型模型采样一个*草案*(token块或序列),然后由大语言模型并行对草案中的所有token进行评分。基于统计方法接受草案中的部分token(并拒绝其余部分),以确保最终输出符合大模型的分布。在本工作中,我们通过*成员代价*最优传输的视角,对推测解码提供了原理性理解。该框架可被视为著名的*最大耦合*问题的扩展。这种新表述使我们能够将推测解码方法推广到允许在token级别设置$k$个候选集,从而带来更优的最优成员代价。我们证明最优草案选择算法(传输计划)可通过线性规划计算,其目前已知的最优运行时间随$k$呈指数增长。随后,我们提出了一种有效的草案选择算法,其接受概率在乘法意义上达到$(1-1/e)$最优。此外,该算法可在近似线性于单个token域大小的时间内计算完成。利用这种*新草案选择*算法,我们开发了一种新的自回归采样算法——*SpecTr*,它在加速解码的同时保证解码输出质量无损。实验表明,对于最先进的大语言模型,所提方法在标准基准测试中实现了2.13倍的墙钟加速,相比推测解码进一步提升了1.37倍加速比。