We propose LLMA, an LLM accelerator to losslessly speed up Large Language Model (LLM) inference with references. LLMA is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios (e.g., retrieved documents). LLMA first selects a text span from the reference and copies its tokens to the decoder and then efficiently checks the tokens' appropriateness as the decoding result in parallel within one decoding step. The improved computational parallelism allows LLMA to achieve over 2x speed-up for LLMs with identical generation results as greedy decoding in many practical generation scenarios where significant overlap between in-context reference and outputs exists (e.g., search engines and multi-turn conversations).
翻译:我们提出LLMA,一种利用参考实现大型语言模型推理无损加速的加速器。LLMA的动机源于观察到:在许多现实场景中(如检索文档),大型语言模型的解码结果与可用的参考文本之间存在大量相同的文本片段。LLMA首先从参考中选取一个文本片段,将其词元复制到解码器,然后在一个解码步骤内通过并行计算高效验证这些词元作为解码结果的合理性。这种改进的计算并行性使LLMA能够在许多实际生成场景(如搜索引擎和多轮对话)中,在保持与贪婪解码完全相同的生成结果前提下,实现超过2倍的加速效果。