Transformer has achieved remarkable success in language, image, and speech processing. Recently, various efficient attention architectures have been proposed to improve transformer's efficiency while largely preserving its efficacy, especially in modeling long sequences. A widely-used benchmark to test these efficient methods' capability on long-range modeling is Long Range Arena (LRA). However, LRA only focuses on the standard bidirectional (or noncausal) self attention, and completely ignores cross attentions and unidirectional (or causal) attentions, which are equally important to downstream applications. In this paper, we propose Comprehensive Attention Benchmark (CAB) under a fine-grained attention taxonomy with four distinguishable attention patterns, namely, noncausal self, causal self, noncausal cross, and causal cross attentions. CAB collects seven real-world tasks from different research areas to evaluate efficient attentions under the four attention patterns. Among these tasks, CAB validates efficient attentions in eight backbone networks to show their generalization across neural architectures. We conduct exhaustive experiments to benchmark the performances of nine widely-used efficient attention architectures designed with different philosophies on CAB. Extensive experimental results also shed light on the fundamental problems of efficient attentions, such as efficiency length against vanilla attention, performance consistency across attention patterns, the benefit of attention mechanisms, and interpolation/extrapolation on long-context language modeling.
翻译:Transformer在语言、图像和语音处理中取得了显著成功。近年来,为提升Transformer效率同时保持其有效性(尤其在长序列建模中),学界提出了多种高效注意力架构。当前广泛用于测试这些高效方法长程建模能力的基准是Long Range Arena(LRA),然而LRA仅关注标准的双向(或非因果)自注意力,完全忽略了在实际应用中同等重要的交叉注意力和单向(或因果)注意力。本文在细粒度注意力分类框架下提出全面注意力基准(CAB),涵盖四种可区分的注意力模式:非因果自注意力、因果自注意力、非因果交叉注意力和因果交叉注意力。CAB从不同研究领域收集了七个真实世界任务,以评估四种注意力模式下高效注意力的表现。通过将这些任务嵌入八个骨干网络,CAB验证了高效注意力在不同神经架构间的泛化能力。我们进行了详尽的实验,对基于不同设计理念的九种主流高效注意力架构在CAB上进行了性能基准测试。广泛的实验结果还揭示了高效注意力的根本性问题,包括相对于标准注意力机制的有效长度、跨注意力模式的性能一致性、注意力机制的增益效应,以及长上下文语言建模中的插值/外推能力。