Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, language models support this by running an autoregressive inference pass to provide a draft. Consequently, providing $k$ drafts to the user requires running an expensive language model $k$ times. To alleviate the computation cost of running $k$ inference passes, we propose Superposed Decoding, a new decoding algorithm that generates $k$ drafts at the computation cost of one autoregressive inference pass. We achieve this by feeding a superposition of the most recent token embeddings from the $k$ drafts as input to the next decoding step of the language model. At every inference step we combine the $k$ drafts with the top-$k$ tokens to get $k^2$ new drafts and cache the $k$ most likely options, using an n-gram interpolation with minimal compute overhead to filter out incoherent generations. Our experiments show that $k$ drafts from Superposed Decoding are at least as coherent and factual as Nucleus Sampling and Greedy Decoding respectively, while being at least $2.44\times$ faster for $k\ge3$. In a compute-normalized setting, user evaluations demonstrably favor text generated by Superposed Decoding over Nucleus Sampling. Code and more examples open-sourced at https://github.com/RAIVNLab/SuperposedDecoding.
翻译:当前许多应用在用户输入时提供多个自动补全草稿,包括GitHub的代码补全、Gmail的智能撰写和苹果的消息自动建议。其底层机制是通过运行自回归推理过程来生成草稿。因此,向用户提供$k$个草稿需要运行昂贵的语言模型$k$次。为降低$k$次推理的计算成本,本文提出叠加解码——一种新的解码算法,能以单次自回归推理的计算成本生成$k$个草稿。该方法通过将$k$个草稿的最新词元嵌入进行叠加,作为语言模型下一解码步骤的输入实现。在每个推理步骤中,我们将$k$个草稿与Top-$k$词元组合得到$k^2$个新草稿,并通过最小计算开销的n元插值过滤不连贯生成,缓存$k$个最可能的选项。实验表明:叠加解码生成的$k$个草稿在连贯性和事实准确性上至少分别与核采样和贪婪解码相当,且在$k\ge3$时速度提升至少$2.44\times$。在计算归一化设定下,用户评估明显更倾向于叠加解码生成的文本。代码与更多示例已开源:https://github.com/RAIVNLab/SuperposedDecoding。