Recent years have seen a phenomenal rise in performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained Transformer (GPT) and Vision Transformer (ViT), have shown their effectiveness across Natural Language Processing (NLP) and Computer Vision (CV) domains. Transformer-based networks such as ChatGPT have impacted the lives of common men. However, the quest for high predictive performance has led to an exponential increase in transformers' memory and compute footprint. Researchers have proposed techniques to optimize transformer inference at all levels of abstraction. This paper presents a comprehensive survey of techniques for optimizing the inference phase of transformer networks. We survey techniques such as knowledge distillation, pruning, quantization, neural architecture search and lightweight network design at the algorithmic level. We further review hardware-level optimization techniques and the design of novel hardware accelerators for transformers. We summarize the quantitative results on the number of parameters/FLOPs and accuracy of several models/techniques to showcase the tradeoff exercised by them. We also outline future directions in this rapidly evolving field of research. We believe that this survey will educate both novice and seasoned researchers and also spark a plethora of research efforts in this field.
翻译:近年来,Transformer神经网络在性能与应用方面取得了显著突破。以BERT(双向编码器表示Transformer)、GPT(生成式预训练Transformer)和ViT(视觉Transformer)为代表的Transformer网络家族,已在自然语言处理(NLP)和计算机视觉(CV)领域展现出卓越效能。基于Transformer的模型如ChatGPT已深刻影响普通民众的日常生活。然而,对高预测性能的追求导致Transformer的内存占用与计算量呈现指数级增长。研究人员已在各个抽象层面提出优化Transformer推理的技术。本文系统综述了Transformer网络推理阶段的优化技术:在算法层面,梳理了知识蒸馏、剪枝、量化、神经架构搜索及轻量级网络设计等方法;进一步介绍了硬件级优化技术及面向Transformer的新型硬件加速器设计。通过汇总多种模型/技术的参数量/FLOPs与精度等量化指标,揭示了各类方法在性能与效率间的权衡。最后展望了这一快速演进研究领域的未来方向。本综述旨在为初入该领域的研究者及资深学者提供全面指导,并激发该方向更广泛的研究探索。