The self-attention mechanism, at the heart of the Transformer model, is able to effectively model pairwise interactions between tokens. However, numerous recent works have shown that it is unable to perform basic tasks involving detecting triples of correlated tokens, or compositional tasks where multiple input tokens need to be referenced to generate a result. Some higher-dimensional alternatives to self-attention have been proposed to address this, including higher-order attention and Strassen attention, which can perform some of these polyadic tasks in exchange for slower, superquadratic running times. In this work, we define a vast class of generalizations of self-attention, which we call poly-attention mechanisms. Our mechanisms can incorporate arbitrary higher-order (tensor) computations as well as arbitrary relationship structures between the input tokens, and they include the aforementioned alternatives as special cases. We then systematically study their computational complexity and representational strength, including giving new algorithms and matching complexity-theoretic lower bounds on the time complexity of computing the attention matrix exactly as well as approximately, and tightly determining which polyadic tasks they can each perform. Our results give interesting trade-offs between different desiderata for these mechanisms, including a tight relationship between how expressive a mechanism is, and how large the coefficients in the model may be so that the mechanism can be approximated in almost-linear time. Notably, we give a new attention mechanism which can be computed exactly in quadratic time, and which can perform function composition for any fixed number of functions. Prior mechanisms, even for just composing two functions, could only be computed in superquadratic time, and our new lower bounds show that faster algorithms for them are not possible.
翻译:自注意力机制作为Transformer模型的核心,能够有效建模标记之间的成对交互。然而,近期大量研究表明,该机制无法执行涉及检测三个相关标记的基础任务,也无法完成需要引用多个输入标记以生成结果的组合任务。为应对这一问题,已有研究提出若干自注意力的高维替代方案,包括高阶注意力和Strassen注意力,这些方案能够执行部分多元任务,但代价是运行时间较慢且具有超二次复杂度。本文定义了一类广泛的自注意力泛化机制,称为多注意力机制。该机制能够整合任意高阶(张量)计算以及输入标记之间的任意关系结构,并将前述替代方案作为特例包含其中。我们系统研究了其计算复杂度与表示能力,包括提出新算法、给出计算注意力矩阵(精确计算与近似计算)时间复杂度的匹配下界,并严格界定了各机制可执行的多元任务范围。研究结果揭示了这些机制在不同需求间的权衡关系,特别是机制表达能力与模型系数大小之间的紧密联系——后者决定了该机制能否在近线性时间内被近似计算。值得注意的是,我们提出了一种可在二次时间内精确计算的新注意力机制,该机制能够执行任意固定数量函数的组合运算。而现有机制即使仅组合两个函数,也需要超二次时间进行计算,且我们的新下界证明对此类机制不存在更快的算法。