As pretrained transformer language models continue to achieve state-of-the-art performance, the Natural Language Processing community has pushed for advances in model compression and efficient attention mechanisms to address high computational requirements and limited input sequence length. Despite these separate efforts, no investigation has been done into the intersection of these two fields. In this work, we provide an evaluation of model compression via knowledge distillation on efficient attention transformers. We provide cost-performance trade-offs for the compression of state-of-the-art efficient attention architectures and the gains made in performance in comparison to their full attention counterparts. Furthermore, we introduce a new long-context Named Entity Recognition dataset, GONERD, to train and test the performance of NER models on long sequences. We find that distilled efficient attention transformers can preserve a significant amount of original model performance, preserving up to 98.6% across short-context tasks (GLUE, SQUAD, CoNLL-2003), up to 94.6% across long-context Question-and-Answering tasks (HotpotQA, TriviaQA), and up to 98.8% on long-context Named Entity Recognition (GONERD), while decreasing inference times by up to 57.8%. We find that, for most models on most tasks, performing knowledge distillation is an effective method to yield high-performing efficient attention models with low costs.
翻译:随着预训练Transformer语言模型持续取得最先进性能,自然语言处理领域不断推动模型压缩与高效注意力机制的发展,以应对高计算需求和有限的输入序列长度问题。尽管这两方面研究各自推进,但尚未有研究探讨它们的交叉领域。本文系统评估了基于知识蒸馏的高效注意力Transformer模型压缩方法,量化了压缩最先进高效注意力架构的成本-性能权衡,并对比了其与全注意力模型的性能提升。此外,我们提出了新的长文本命名实体识别数据集GONERD,用于训练和测试长序列NER模型性能。实验表明,蒸馏后的高效注意力Transformer能够保留原始模型的绝大部分性能:在短文本任务(GLUE、SQUAD、CoNLL-2003)中保留高达98.6%的性能,在长文本问答任务(HotpotQA、TriviaQA)中保留高达94.6%,在长文本命名实体识别任务(GONERD)中保留高达98.8%,同时将推理时间降低高达57.8%。我们发现,对大多数任务中的大多数模型而言,知识蒸馏是生成高性能、低成本高效注意力模型的有效方法。