Transformer-based models have achieved great success in various NLP, vision, and speech tasks. However, the core of Transformer, the self-attention mechanism, has a quadratic time and memory complexity with respect to the sequence length, which hinders applications of Transformer-based models to long sequences. Many approaches have been proposed to mitigate this problem, such as sparse attention mechanisms, low-rank matrix approximations and scalable kernels, and token mixing alternatives to self-attention. We propose a novel Pooling Network (PoNet) for token mixing in long sequences with linear complexity. We design multi-granularity pooling and pooling fusion to capture different levels of contextual information and combine their interactions with tokens. On the Long Range Arena benchmark, PoNet significantly outperforms Transformer and achieves competitive accuracy, while being only slightly slower than the fastest model, FNet, across all sequence lengths measured on GPUs. We also conduct systematic studies on the transfer learning capability of PoNet and observe that PoNet achieves 95.7% of the accuracy of BERT on the GLUE benchmark, outperforming FNet by 4.5% relative. Comprehensive ablation analysis demonstrates effectiveness of the designed multi-granularity pooling and pooling fusion for token mixing in long sequences and efficacy of the designed pre-training tasks for PoNet to learn transferable contextualized language representations.
翻译:摘要:基于Transformer的模型在自然语言处理、计算机视觉和语音等多项任务中取得了显著成功。然而,Transformer的核心——自注意力机制,其时间与内存复杂度随序列长度呈二次增长,这阻碍了Transformer模型在长序列上的应用。为缓解该问题,研究者提出了多种方法,例如稀疏注意力机制、低秩矩阵近似与可扩展核函数,以及自注意力的令牌混合替代方案。我们提出了一种新颖的池化网络(PoNet),能以线性复杂度实现长序列的令牌混合。我们设计了多粒度池化与池化融合机制,以捕捉不同层次的上下文信息,并整合它们与令牌的交互。在Long Range Arena基准测试中,PoNet显著优于Transformer,并在GPU上所有序列长度测量中取得与最快模型FNet相当的精度,仅略慢于FNet。我们还系统研究了PoNet的迁移学习能力,发现其在GLUE基准测试上达到了BERT精度的95.7%,相对FNet提升了4.5%。全面的消融分析证明了所设计的多粒度池化与池化融合机制在长序列令牌混合中的有效性,以及预训练任务对PoNet学习可迁移的上下文语言表征的效能。