Contemporary large language models (LLMs) primarily rely on next-token prediction method for inference, which significantly impedes their processing speed. In this paper, we introduce a novel inference methodology termed next-sentence prediction, aimed at enhancing the inference efficiency of LLMs. We present Sentence Variational Autoencoder (SentenceVAE), a tiny model consisting of a Sentence Encoder and a Sentence Decoder. The encoder effectively condenses the information within a sentence into a singular token, while the decoder reconstructs this compressed data back into its original sentential form. By integrating SentenceVAE into the input and output layers of LLMs, we develop Sentence-level LLMs (SLLMs) that employ a sentence-by-sentence inference approach, markedly accelerating inference speeds. SentenceVAE also maintains the integrity of the original semantic content by segmenting the text into sentences, thereby improving accuracy while boosting inference speeds. Compared to published LLMs, SLLMs process fewer tokens over equivalent context lengths, significantly reducing memory demands for self-attention computations and facilitating the handling of longer contexts. Our experimental findings reveal that this method can accelerate inference speeds by 204~365%, reduce perplexity (PPL) to 46~75% of its original metric, and decrease memory overhead by 86~91% for the same context length, compared to the token-by-token method. Moreover, the benefits of this approach become even more pronounced as model parameters increase.
翻译:当代大语言模型主要依赖下一词预测方法进行推理,这严重制约了其处理速度。本文提出一种称为下一句预测的新型推理方法,旨在提升大语言模型的推理效率。我们提出了句子变分自编码器,该微型模型由句子编码器与句子解码器构成。编码器将句子内的信息有效压缩为单一词元,解码器则将压缩数据重构回原始句子形式。通过将SentenceVAE集成至大语言模型的输入与输出层,我们开发出采用逐句推理方式的句子级大语言模型,显著提升了推理速度。SentenceVAE通过将文本分割为句子来保持原始语义内容的完整性,从而在提升推理速度的同时提高了准确性。与已发布的大语言模型相比,SLLMs在同等上下文长度下处理的词元更少,显著降低了自注意力计算的内存需求,并有利于处理更长的上下文。实验结果表明,与逐词推理方法相比,本方法可将推理速度提升204%~365%,将困惑度降低至原指标的46%~75%,并在相同上下文长度下减少86%~91%的内存开销。此外,随着模型参数规模的增大,该方法的优势将更加显著。