Latency Adjustable Transformer Encoder for Language Understanding

Adjusting the latency, power, and accuracy of natural language understanding models is a desirable objective of an efficient architecture. This paper proposes an efficient Transformer architecture that adjusts the inference computational cost adaptively with a desired inference latency speedup. In fine-tuning phase, the proposed method detects less important hidden sequence elements (word-vectors) and eliminates them in each encoder layer using a proposed Attention Context Contribution (ACC) metric. After the fine-tuning phase, with the novel offline-tuning property, the inference latency of the model can be adjusted in a wide range of inference speedup selections without any further training. The proposed method is applied to the BERT_base, GPT-2 and Flan-T5 models for evaluation. Extensive experiments show that most of the word-vectors in higher Transformer layers have less contribution to the subsequent layers; hence, they can be eliminated to improve the inference latency. Experimental results on extensive sentiment analysis, classification, text generation tasks and regression benchmarks like GLUE showed that the method is effective in various datasets with minimal impact on the input's global context. The method was also evaluated under the instruction tuning paradigm, and its performance was measured using different types of prompting. The proposed method mathematically and experimentally improves the inference latency of BERT_base and GPT-2 by up to 4.8 and 3.72 times with less than 0.75% accuracy drop and passable perplexity on average. The suggested approach posits that in Large Language Models (LLMs), although the complete network is necessary for training, it can be truncated during the fine-tuning phase.

翻译：调整自然语言理解模型的延迟、功耗与精度是高效架构设计的重要目标。本文提出一种高效的Transformer架构，能够根据期望的推理延迟加速比自适应调整推理计算成本。在微调阶段，该方法通过提出的注意力上下文贡献度（ACC）度量指标，检测并逐层消除编码器中贡献度较低的隐藏序列元素（词向量）。微调完成后，凭借新颖的离线调优特性，模型可在无需额外训练的情况下，通过选择不同的推理加速比实现大范围的延迟调整。该方法在BERT_base、GPT-2和Flan-T5模型上进行了评估。大量实验表明：较高Transformer层中的多数词向量对后续层贡献有限，可通过消除这些向量提升推理速度。在情感分析、分类、文本生成等任务以及GLUE回归基准测试中的实验结果表明，该方法在不同数据集上均能有效运行，且对输入全局语境的影响极小。研究还评估了该方法在指令调优范式下的表现，并通过多种提示类型测量其性能。所提方法在数学推导与实验验证中，将BERT_base和GPT-2的推理延迟分别提升至4.8倍和3.72倍，平均准确率损失低于0.75%，困惑度保持在可接受范围。该研究提出：在大语言模型中，虽然完整网络对训练过程是必要的，但在微调阶段可对网络进行剪裁。