The quadratic complexity of the attention mechanism represents one of the biggest hurdles for processing long sequences using Transformers. Current methods, relying on sparse representations or stateful recurrence, sacrifice token-to-token interactions, which ultimately leads to compromises in performance. This paper introduces TaylorShift, a novel reformulation of the Taylor softmax that enables computing full token-to-token interactions in linear time and space. We analytically determine the crossover points where employing TaylorShift becomes more efficient than traditional attention, aligning closely with empirical measurements. Specifically, our findings demonstrate that TaylorShift enhances memory efficiency for sequences as short as 800 tokens and accelerates inference for inputs of approximately 1700 tokens and beyond. For shorter sequences, TaylorShift scales comparably with the vanilla attention. Furthermore, a classification benchmark across five tasks involving long sequences reveals no degradation in accuracy when employing Transformers equipped with TaylorShift. For reproducibility, we provide access to our code under https://github.com/tobna/TaylorShift.
翻译:注意力机制的二次复杂度是使用Transformer处理长序列时面临的最大障碍之一。当前依赖稀疏表示或状态化循环的方法牺牲了token间的交互,最终导致性能折损。本文提出TaylorShift,一种对泰勒softmax的全新重构方法,能够在线性时间和空间复杂度下计算完整的token间交互。我们通过理论分析确定了采用TaylorShift比传统注意力更高效的交叉点,该结果与实验测量高度吻合。具体而言,我们的研究表明,TaylorShift对短至800个token的序列即可提升内存效率,并对约1700个token及以上的输入加速推理。对于更短序列,TaylorShift的扩展性与标准注意力相当。此外,在涉及长序列的五项任务分类基准测试中,采用TaylorShift的Transformer未出现准确率下降。为保证可复现性,我们在https://github.com/tobna/TaylorShift提供代码访问。