To mitigate the computational complexity in the self-attention mechanism on long sequences, linear attention utilizes computation tricks to achieve linear complexity, while state space models (SSMs) popularize a favorable practice of using non-data-dependent memory pattern, i.e., emphasize the near and neglect the distant, to processing sequences. Recent studies have shown the priorities by combining them as one. However, the efficiency of linear attention remains only at the theoretical level in a causal setting, and SSMs require various designed constraints to operate effectively on specific data. Therefore, in order to unveil the true power of the hybrid design, the following two issues need to be addressed: (1) hardware-efficient implementation for linear attention and (2) stabilization of SSMs. To achieve this, we leverage the thought of tiling and hierarchy to propose CHELA (short-long Convolutions with Hardware-Efficient Linear Attention), which replaces SSMs with short-long convolutions and implements linear attention in a divide-and-conquer manner. This approach enjoys global abstraction and data-dependent selection from stable SSM and linear attention while maintaining real linear complexity. Our comprehensive experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
翻译:为缓解自注意力机制在长序列上的计算复杂度,线性注意力通过计算技巧实现线性复杂度,而状态空间模型(SSMs)则推广了一种利用非数据依赖的记忆模式(即重近轻远)处理序列的优良实践。近期研究表明,将二者结合具有优势。然而,线性注意力的效率在因果设定下仅停留在理论层面,且SSMs需要多种精心设计的约束才能在特定数据上有效运行。因此,为充分释放混合设计的潜力,需解决以下两个问题:(1)线性注意力的硬件高效实现;(2)SSMs的稳定性。为此,我们借鉴分块与层级化思想,提出CHELA(基于短长卷积的硬件高效线性注意力),其以短长卷积替代SSMs,并以分治方式实现线性注意力。该方法在保持真实线性复杂度的同时,兼具稳定SSM的全局抽象能力与线性注意力的数据依赖选择特性。我们在长距离竞技场基准测试与语言建模任务上的综合实验验证了所提方法的有效性。