The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generating the next token. One common variant of attention is called local attention, which restricts each token to aggregating information from a bounded window of predecessors, reducing the quadratic cost of global attention to linear. Although this restriction is usually motivated by efficiency, it has also been found to improve model quality, a phenomenon that has so far lacked a satisfactory explanation. We provide a formal account of this phenomenon in terms of recognizer expressivity. It has been shown that fixed-precision transformers with global attention correspond to a fragment of linear temporal logic containing a single past operator. We additionally prove that adding local attention introduces a second temporal operator, strictly enlarging the class of recognizable regular languages. Moreover, global and local attention are expressively complementary: neither subsumes the other, and combining them yields the richest fragment. Experiments on formal language recognition and natural language modeling corroborate the theory, showing that hybrid global--local transformers outperform their global-only counterparts.
翻译:Transformer是语言建模中最流行的神经网络架构。其基石在于全局注意力机制——该机制使模型在生成下一个token前,能够聚合所有前序token的信息。注意力机制的一种常见变体称为局部注意力,它将每个token的信息聚合范围限制在有限窗口内的前序token上,从而将全局注意力的二次计算成本降低为线性。尽管这一限制通常源于效率考量,但研究发现它还能提升模型质量,而这一现象迄今缺乏令人满意的解释。我们通过识别器表达能力对此现象进行了形式化分析。已有研究表明,采用固定精度的全局注意力Transformer对应于包含单一过去时态算子的线性时态逻辑片段。我们进一步证明,引入局部注意力相当于增加了第二个时态算子,从而严格扩展了可识别正则语言类别的范围。此外,全局注意力与局部注意力在表达能力上具有互补性:两者互不包含,而组合使用则能获得最丰富的逻辑片段。形式语言识别与自然语言建模实验验证了该理论,表明混合全局-局部注意力Transformer的性能优于纯全局注意力模型。