Deploying large language model inference remains challenging due to their high computational overhead. Early exiting accelerates model inference by adaptively reducing the number of inference layers. Existing methods require training internal classifiers to determine whether to exit at each intermediate layer. However, such classifier-based early exiting frameworks require significant effort to design and train the classifiers. To address these limitations, this paper proposes RAEE, a training-free Retrieval-Augmented Early Exiting framework for efficient inference. First, this paper demonstrates that the early exiting problem can be modeled as a distribution prediction problem, where the distribution is approximated using similar data's existing information. Next, the paper details the process of collecting existing information to build the retrieval database. Finally, based on the pre-built retrieval database, RAEE leverages the retrieved similar data's exiting information to guide the backbone model to exit at the layer, which is predicted by the approximated distribution. Experimental results demonstrate that the proposed RAEE can significantly accelerate inference. RAEE also achieves state-of-the-art zero-shot performance on 8 classification tasks.
翻译:部署大型语言模型推理仍然具有挑战性,主要源于其高昂的计算开销。早期退出通过自适应地减少推理层数来加速模型推理。现有方法需要训练内部分类器以决定是否在每个中间层退出。然而,此类基于分类器的早期退出框架需要投入大量精力来设计和训练分类器。为了应对这些局限性,本文提出了RAEE,一种用于高效推理的无训练检索增强早期退出框架。首先,本文论证了早期退出问题可以建模为一个分布预测问题,其中该分布通过使用相似数据的现有信息进行近似。接着,本文详述了收集现有信息以构建检索数据库的过程。最后,基于预构建的检索数据库,RAEE利用检索到的相似数据的退出信息来指导骨干模型在由近似分布预测的层退出。实验结果表明,所提出的RAEE能显著加速推理。RAEE还在8个分类任务上实现了最先进的零样本性能。