Linear attention mechanisms have gained prominence in causal language models due to their linear computational complexity and enhanced speed. However, the inherent decay mechanism in linear attention presents challenges when applied to multi-dimensional sequence modeling tasks, such as image processing and multi-modal learning. In these scenarios, the utilization of sequential scanning to establish a global receptive field necessitates multiple scans for multi-dimensional data, thereby leading to inefficiencies. This paper identifies the inefficiency caused by a multiplicative linear recurrence and proposes an efficient alternative additive linear recurrence to avoid the issue, as it can handle multi-dimensional data within a single scan. We further develop an efficient multi-dimensional sequential modeling framework called LightNet based on the new recurrence. Moreover, we present two new multi-dimensional linear relative positional encoding methods, MD-TPE and MD-LRPE to enhance the model's ability to discern positional information in multi-dimensional scenarios. Our empirical evaluations across various tasks, including image classification, image generation, bidirectional language modeling, and autoregressive language modeling, demonstrate the efficacy of LightNet, showcasing its potential as a versatile and efficient solution for multi-dimensional sequential modeling.
翻译:线性注意力机制因其线性计算复杂度和更快的速度,在因果语言模型中获得了广泛关注。然而,线性注意力中固有的衰减机制在应用于多维序列建模任务(如图像处理与多模态学习)时面临挑战。在这些场景中,利用顺序扫描来建立全局感受野需要对多维数据进行多次扫描,从而导致效率低下。本文指出了由乘法线性递归引起的效率问题,并提出了一种高效的替代方案——加法线性递归,该方案可在单次扫描内处理多维数据,从而避免了上述问题。基于这一新的递归形式,我们进一步开发了一个名为LightNet的高效多维序列建模框架。此外,我们提出了两种新的多维线性相对位置编码方法,MD-TPE与MD-LRPE,以增强模型在多维场景下对位置信息的辨识能力。我们在图像分类、图像生成、双向语言建模及自回归语言建模等多种任务上进行的实证评估验证了LightNet的有效性,展示了其作为一种通用且高效的多维序列建模解决方案的潜力。