Current large language models (LLMs) primarily utilize next-token prediction method for inference, which significantly impedes their processing speed. In this paper, we introduce a novel inference methodology termed next-sentence prediction, aiming at enhancing the inference efficiency of LLMs. We present Sentence Variational Autoencoder (SentenceVAE), which includes a Sentence Encoder to compress multiple tokens in a sentence into a single token, and a Sentence Decoder to reconstruct it. By integrating SentenceVAE into the input and output layers of LLMs, we develop Sentence-level LLMs (SLLMs) that employ a sentence-by-sentence inference method. In addition, the SentenceVAE module of SLLMs can maintain the integrity of the original semantic content by segmenting the context into sentences, thereby improving accuracy while boosting inference speed. Moreover, compared to previous LLMs, SLLMs process fewer tokens over equivalent context length, significantly reducing memory demands for self-attention computation and facilitating the handling of longer context. Extensive experiments on Wanjuan dataset have revealed that the proposed method can accelerate inference speed by 204~365%, reduce perplexity (PPL) to 46~75% of its original metric, and decrease memory overhead by 86~91% for the equivalent context length, compared to previous token-by-token methods.
翻译:当前大型语言模型(LLMs)主要采用下一词元预测方法进行推理,这严重制约了其处理速度。本文提出了一种称为下一句预测的新型推理方法,旨在提升LLMs的推理效率。我们提出了句子变分自编码器(SentenceVAE),其包含将句子中多个词元压缩为单个词元的句子编码器,以及用于重构的句子解码器。通过将SentenceVAE集成到LLMs的输入和输出层,我们开发了采用逐句推理方法的句子级大型语言模型(SLLMs)。此外,SLLMs的SentenceVAE模块能够通过将上下文分割为句子来保持原始语义内容的完整性,从而在提升推理速度的同时提高准确性。而且,与先前LLMs相比,SLLMs在同等上下文长度下处理的词元更少,显著降低了自注意力计算的内存需求,并有助于处理更长的上下文。在万卷数据集上的大量实验表明,与先前逐词元方法相比,所提方法可将推理速度提升204~365%,将困惑度(PPL)降低至原指标的46~75%,并在同等上下文长度下将内存开销减少86~91%。