Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models' efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer's limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice in order to meet the desired trade-off between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods' strengths, limitations, and underlying assumptions.
翻译:递归神经网络是处理序列的有效模型。然而,由于其固有的顺序性,它们无法学习长期依赖关系。为此,Vaswani等人提出了Transformer——一种完全基于注意力机制的模型,能够关联输入序列中的任意两个位置,从而建模任意长度的依赖关系。Transformer在众多序列建模任务中提升了现有最优技术的水平。但其有效性是以序列长度的二次计算和内存复杂度为代价的,这阻碍了其广泛应用。幸运的是,深度学习社区始终致力于提升模型效率,由此涌现出参数共享、剪枝、混合精度训练和知识蒸馏等大量解决方案。近期,研究人员通过设计低复杂度的替代方案(如Longformer、Reformer、Linformer和Performer)直接应对Transformer的局限性。然而,由于解决方案种类繁多,研究人员和从业者越来越难以确定在实践中应用哪些方法,以在容量、计算和内存之间实现理想的权衡。本综述通过研究使Transformer更快更轻量的主流方法,并对这些方法的优势、局限性及潜在假设进行全面阐释,从而解决了这一问题。