BEExformer: A Fast Inferencing Binarized Transformer with Early Exits

from arxiv, This revised manuscript includes 18 pages, 6 figures, and 6 tables. Methodology and results sections have been improved for clarity and depth, incorporating additional comparisons, ablations, and new evaluation datasets. A few relevant references were added, and overall organization refined for better readability

Large Language Models (LLMs) based on transformers achieve cutting-edge results on a variety of applications. However, their enormous size and processing requirements hinder deployment on constrained resources. To enhance efficiency, binarization and Early Exit (EE) have proved to be effective solutions. However, binarization may lead to performance loss as reduced precision affects gradient estimation and parameter updates. Besides, research on EE mechanisms is still in its early stages. To address these challenges, we introduce Binarized Early Exit Transformer (BEExformer), a first-of-its-kind selective learning-based transformer integrating Binarization-Aware Training (BAT) with EE for efficient and fast textual inference. Each transformer block has an integrated Selective-Learn Forget Network (SLFN) to enhance contextual retention while eliminating irrelevant information. The BAT employs a differentiable second-order approximation to the sign function, enabling gradient computation that captures both the sign and magnitude of the weights. This aids in 21.30 times reduction in model size. The EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation. This accelerates inference by reducing FLOPs by 52.27% and even improves accuracy by 3.22% by resolving the "overthinking" problem inherent in deep networks. Extensive evaluation through comparison with the SOTA methods and various ablations across nine datasets covering multiple NLP tasks demonstrates its Pareto-optimal performance-efficiency trade-off.

翻译：基于Transformer架构的大型语言模型在各种应用中取得了尖端成果。然而，其庞大的规模和计算需求限制了在资源受限环境下的部署。为提高效率，二值化和早期退出已被证明是有效的解决方案。然而，二值化可能因精度降低而影响梯度估计和参数更新，导致性能损失。此外，针对早期退出机制的研究仍处于初步阶段。为解决这些挑战，我们提出了二值化早期退出Transformer（BEExformer），这是首个融合二值化感知训练与早期退出的选择性学习Transformer，旨在实现高效快速的文本推理。每个Transformer模块集成了选择性学习遗忘网络，以增强上下文保留能力同时消除无关信息。二值化感知训练采用符号函数的可微二阶近似，实现同时捕捉权重符号和幅度的梯度计算，从而将模型体积缩减21.30倍。早期退出机制基于中间Transformer模块间熵的分数阶递减，结合软路由损失估计。该方法通过减少52.27%的FLOPs加速推理，并通过解决深度网络中固有的“过度思考”问题提升3.22%的准确率。通过与最先进方法的广泛比较及覆盖九大数据集的多项消融实验（涵盖多种自然语言处理任务），证明了该方法在性能与效率之间实现了帕累托最优权衡。