Contemporary large language models (LLMs) predominantly utilize a next-token prediction method for inference, which significantly impedes their processing speed. In this paper, we introduce a novel inference methodology termed next-sentence prediction, aimed at enhancing the inference efficiency of LLMs. We present SentenceVAE, a tiny model consisting of an encoder and a decoder. The encoder effectively condenses the information within a sentence into a singular token, while the decoder reconstructs this compressed data back into its original sentential form. By integrating SentenceVAE into the input and output layers of LLMs, we develop Sentence-level LLMs (SLLMs) that employ a sentence-by-sentence inference approach, markedly accelerating inference speeds. SentenceVAE also maintains the integrity of the original semantic content by segmenting the text into sentences, thereby preserving accuracy while boosting inference speeds. Compared to traditional LLMs, SLLMs process fewer tokens over equivalent context lengths, significantly reducing memory demands for Self-Attention computations and facilitating the handling of longer contexts. Our experimental findings reveal that this method can increase inference speeds by 204~365%, reduce perplexity (PPL) to 46~75% of its original metric, and decrease memory overhead by 86~91% for the same context length. The advantages of this approach are further amplified with increases in model parameters.
翻译:当代大型语言模型(LLMs)主要采用下一词预测方法进行推理,这严重制约了其处理速度。本文提出一种称为下一句预测的新型推理方法,旨在提升LLMs的推理效率。我们提出了SentenceVAE——一个由编码器和解码器组成的轻量模型。编码器能够将句子中的信息有效压缩为单个词元,而解码器则可将该压缩数据重构为原始句子形式。通过将SentenceVAE集成到LLMs的输入层和输出层,我们构建了采用逐句推理方式的句子级大型语言模型(SLLMs),从而显著提升了推理速度。SentenceVAE通过将文本分割为句子来保持原始语义内容的完整性,在提升推理速度的同时保证了准确性。与传统LLMs相比,SLLMs在相同上下文长度下处理的词元更少,显著降低了自注意力计算的内存需求,并有利于处理更长的上下文。实验结果表明,该方法可将推理速度提升204%~365%,将困惑度(PPL)降低至原指标的46%~75%,并在相同上下文长度下减少86%~91%的内存开销。随着模型参数量的增加,该方法的优势将进一步放大。