This paper presents a novel technique for accelerating inference in large, pre-trained language models (LLMs) by introducing early exits during inference. The computational demands of these models, used across a wide range of applications, can be substantial. By capitalizing on the inherent variability in token complexity, our approach enables selective acceleration of the inference process. Specifically, we propose the integration of early exit ''heads'' atop existing transformer layers, which facilitate conditional terminations based on a confidence metric. These heads are trained in a self-supervised manner using the model's own predictions as training data, thereby eliminating the need for additional annotated data. The confidence metric, established using a calibration set, ensures a desired level of accuracy while enabling early termination when confidence exceeds a predetermined threshold. Notably, our method preserves the original accuracy and reduces computational time on certain tasks, leveraging the existing knowledge of pre-trained LLMs without requiring extensive retraining. This lightweight, modular modification has the potential to greatly enhance the practical usability of LLMs, particularly in applications like real-time language processing in resource-constrained environments.
翻译:本文提出一种新颖技术,通过引入推理过程中的早期退出来加速大规模预训练语言模型的推理过程。这些广泛应用于各类任务的模型通常需要巨大的计算开销。通过利用词元复杂度的固有差异性,我们的方法能够实现对推理过程的选择性加速。具体而言,我们提出在现有Transformer层之上集成早期退出"头",这些头基于置信度指标实现条件化终止。这些头采用自监督方式进行训练,使用模型自身的预测作为训练数据,从而无需额外标注数据。通过校准集建立的置信度指标在保证预期准确率的同时,使得当置信度超过预设阈值时可提前终止计算。值得注意的是,我们的方法在特定任务上保持了原始准确率并减少了计算时间,充分利用了预训练语言模型的现有知识而无需大量重新训练。这种轻量级、模块化的改进有望显著提升语言大模型的实用价值,特别是在资源受限环境中的实时语言处理等应用场景。