Current large language models (LLMs) primarily utilize next-token prediction method for inference, which significantly impedes their processing speed. In this paper, we introduce a novel inference methodology termed next-sentence prediction, aimed at enhancing the inference efficiency of LLMs. We present Sentence Variational Autoencoder (SentenceVAE), a tiny model consisting of a Sentence Encoder and a Sentence Decoder. The Sentence Encoder can effectively condense the information within a sentence into a singular token, while the Sentence Decoder can reconstruct this compressed token back into sentence. By integrating SentenceVAE into the input and output layers of LLMs, we develop Sentence-level LLMs (SLLMs) that employ a sentence-by-sentence inference method. In addition, the SentenceVAE module of SLLMS can maintain the integrity of the original semantic content by segmenting the context into sentences, thereby improving accuracy while boosting inference speed. Moreover, compared to previous LLMs, SLLMs process fewer tokens over equivalent context length, significantly reducing memory demands for self-attention computation and facilitating the handling of longer context. Extensive experiments on Wanjuan dataset have reveal that the proposed method can accelerate inference speed by 204~365%, reduce perplexity (PPL) to 46~75% of its original metric, and decrease memory overhead by 86~91% for the equivalent context length, compared to the token-by-token method.
翻译:当前大语言模型(LLMs)主要采用下一词元预测方法进行推理,这严重制约了其处理速度。本文提出一种称为下一句预测的新型推理方法,旨在提升LLMs的推理效率。我们提出了句子变分自编码器(SentenceVAE),该微型模型由句子编码器和句子解码器构成。句子编码器能够将句子内的信息有效压缩为单一词元,而句子解码器可将该压缩词元重构为原始句子。通过将SentenceVAE集成至LLMs的输入与输出层,我们开发出采用逐句推理方法的句子级大语言模型(SLLMs)。此外,SLLMs的SentenceVAE模块能够通过将上下文分割为句子来保持原始语义内容的完整性,从而在提升推理速度的同时提高准确性。相较于传统LLMs,SLLMs在同等上下文长度下处理的词元更少,显著降低了自注意力计算的内存需求,有利于处理更长上下文。在万卷数据集上的大量实验表明:与逐词元方法相比,所提方法可将推理速度提升204~365%,将困惑度(PPL)降低至原指标的46~75%,并在同等上下文长度下减少86~91%的内存开销。