Sequence modeling has important applications in natural language processing and computer vision. Recently, the transformer-based models have shown strong performance on various sequence modeling tasks, which rely on attention to capture pairwise token relations, and position embedding to inject positional information. While showing good performance, the transformer models are inefficient to scale to long input sequences, mainly due to the quadratic space-time complexity of attention. To overcome this inefficiency, we propose to model sequences with a relative position encoded Toeplitz matrix and use a Toeplitz matrix-vector production trick to reduce the space-time complexity of the sequence modeling to log linear. A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters, enabling the proposed Toeplitz neural network to deal with varying sequence lengths. In addition, despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance. Extensive experiments on autoregressive and bidirectional language modeling, image modeling, and the challenging Long-Range Arena benchmark show that our method achieves better performance than its competitors in most downstream tasks while being significantly faster. The code is available at https://github.com/OpenNLPLab/Tnn.
翻译:序列建模在自然语言处理和计算机视觉中具有重要应用。近年来,基于Transformer的模型在各种序列建模任务中展现出强大性能,这些模型依赖注意力机制捕获成对标记之间的关系,并通过位置嵌入注入位置信息。尽管性能优异,但Transformer模型在扩展到长输入序列时效率低下,主要源于注意力机制二次方时空复杂度。为克服这一局限,我们提出使用相对位置编码的托普利茨矩阵进行序列建模,并利用托普利茨矩阵-向量乘积技巧将序列建模的时空复杂度降低至对数线性。我们设计了一个轻量子网络——相对位置编码器,通过固定参数预算生成相对位置系数,使所提出的托普利茨神经网络能够处理可变长度的序列。此外,尽管仅在512个标记的序列上训练,我们的模型在推理时可外推至14K个标记的输入序列长度,且性能保持一致。在自回归与双向语言建模、图像建模以及具有挑战性的Long-Range Arena基准上的大量实验表明,我们的方法在大多数下游任务中性能优于同类方法,同时速度显著更快。代码已开源在https://github.com/OpenNLPLab/Tnn。