BEExformer: A Fast Inferencing Transformer Architecture via Binarization with Multiple Early Exits

Large Language Models (LLMs) based on transformers achieve cutting-edge results on a variety of applications. However, their enormous size and processing requirements make deployment on devices with constrained resources extremely difficult. Among various efficiency considerations, model binarization and Early Exit (EE) are common effective solutions. However, binarization may lead to performance loss due to reduced precision affecting gradient estimation and parameter updates. Besides, the present early-exit mechanisms are still in the nascent stages of research. To ameliorate these issues, we propose Binarized Early Exit Transformer (BEExformer), the first-ever selective learning transformer architecture to combine early exit with binarization for textual inference. It improves the binarization process through a differentiable second-order approximation to the impulse function. This enables gradient computation concerning both the sign as well as the magnitude of the weights. In contrast to absolute threshold-based EE, the proposed EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation. While binarization results in 18.44 times reduction in model size, early exit reduces the FLOPs during inference by 54.85% and even improves accuracy by 5.98% through resolving the "overthinking" problem inherent in deep networks. Moreover, the proposed BEExformer simplifies training by not requiring knowledge distillation from a full-precision LLM. Extensive evaluation on the GLUE dataset and comparison with the SOTA works showcase its pareto-optimal performance-efficiency trade-off.

翻译：基于Transformer的大型语言模型（LLM）在多种应用上取得了前沿成果。然而，其庞大的规模和处理需求使得在资源受限的设备上部署极为困难。在各种效率优化方案中，模型二值化与早期退出（EE）是常见且有效的解决方案。然而，二值化可能因精度降低影响梯度估计和参数更新而导致性能损失。此外，现有的早期退出机制仍处于研究的初级阶段。为改善这些问题，我们提出了二值化早期退出Transformer（BEExformer），这是首个将早期退出与二值化相结合用于文本推理的选择性学习Transformer架构。它通过可微的二阶冲激函数近似改进了二值化过程，从而能够计算关于权重符号和幅度的梯度。与基于绝对阈值的EE机制不同，所提出的EE机制依赖于中间Transformer块间熵值的分数降低，并采用软路由损失估计。二值化使模型大小减少了18.44倍，早期退出则通过解决深度网络固有的“过度思考”问题，在推理时将FLOPs降低了54.85%，甚至将准确率提升了5.98%。此外，所提出的BEExformer无需从全精度LLM进行知识蒸馏，从而简化了训练过程。在GLUE数据集上的广泛评估以及与最先进工作的比较，展示了其在性能与效率权衡上的帕累托最优性。