Adjusting the latency, power, and accuracy of natural language understanding models is a desirable objective of an efficient architecture. This paper proposes an efficient Transformer architecture that adjusts the inference computational cost adaptively with a desired inference latency speedup. In fine-tuning phase, the proposed method detects less important hidden sequence elements (word-vectors) and eliminates them in each encoder layer using a proposed Attention Context Contribution (ACC) metric. After the fine-tuning phase, with the novel offline-tuning property, the inference latency of the model can be adjusted in a wide range of inference speedup selections without any further training. The proposed method is applied to the BERT-base and GPT-2 models for evaluation. Extensive experiments show that most of the word-vectors in higher Transformer layers have less contribution to the subsequent layers; hence, they can be eliminated to improve the inference latency. Experimental results on extensive sentiment analysis, classification, text generation tasks and regression benchmarks like GLUE showed that the method is effective in various datasets with minimal impact on global context. The proposed method mathematically and experimentally improves the inference latency of BERT-base and GPT-2 by up to 4.8 and 3.72 times with less than 0.75% accuracy drop and passable perplexity on average. The suggested approach posits that in Large Language Models (LLMs), although the complete network is necessary for training, it can be truncated during the fine-tuning phase.
翻译:调整自然语言理解模型的延迟、功耗和准确率是高效架构的理想目标。本文提出一种高效的Transformer架构,能够根据期望的推理加速比自适应调整推理计算成本。在微调阶段,该方法利用提出的注意力上下文贡献(ACC)指标检测隐藏序列中重要性较低的元素(词向量),并在每个编码器层中剔除这些元素。微调完成后,借助新颖的离线调优特性,该模型可在无需进一步训练的条件下,实现宽范围推理加速比下的延迟调节。将所提方法应用于BERT-base和GPT-2模型进行评估。大量实验表明,高层Transformer层中的大部分词向量对后续层的贡献较小,剔除这些词向量可有效降低推理延迟。在情感分析、分类、文本生成任务及GLUE等回归基准上的实验结果显示,该方法对全局上下文影响极小,在多种数据集上均有效。数学推导与实验表明,该方法能使BERT-base和GPT-2的推理延迟分别提升至4.8倍和3.72倍,同时准确率下降低于0.75%,困惑度指标可接受。本方法提出一种观点:在大语言模型(LLMs)中,虽然完整网络对训练阶段必不可少,但在微调阶段可对其进行截断处理。