Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. However, their enormous size and reliance on autoregressive decoding increase deployment costs and complicate their use in latency-critical applications. In this work, we propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding while maintaining high performance. Our method utilizes a pretrained frozen LLM that encodes all prompt tokens once in parallel, and uses the resulting representations to condition and guide a small language model (SLM), which then generates the response more efficiently. We investigate the combination of encoder-decoder LLMs with both encoder-decoder and decoder-only SLMs from different model families and only require fine-tuning of the SLM. Experiments with various benchmarks show substantial speedups of up to $4\times$, with minor performance penalties of $1-2\%$ for translation and summarization tasks compared to the LLM.
翻译:大型语言模型(LLMs)在实践中已无处不在,被广泛用于翻译、摘要和指令跟随等生成任务。然而,其庞大的规模以及对自回归解码的依赖,增加了部署成本,并使它们在延迟敏感型应用中的使用变得复杂。在本工作中,我们提出了一种混合方法,结合不同尺寸的语言模型,以提高自回归解码的效率,同时保持高性能。我们的方法利用一个预训练且冻结的LLM,一次性并行编码所有提示词元,并使用其生成的表征来调节和指导一个小型语言模型(SLM),后者随后能更高效地生成响应。我们研究了来自不同模型家族的编码器-解码器LLM与编码器-解码器及仅解码器SLM的组合,且仅需要对SLM进行微调。在多个基准测试上的实验表明,与LLM相比,该方法在翻译和摘要任务上实现了高达$4\times$的显著加速,性能损失仅为$1-2\%$。