Attention, specifically scaled dot-product attention, has proven effective for natural language, but it does not have a mechanism for handling hierarchical patterns of arbitrary nesting depth, which limits its ability to recognize certain syntactic structures. To address this shortcoming, we propose stack attention: an attention operator that incorporates stacks, inspired by their theoretical connections to context-free languages (CFLs). We show that stack attention is analogous to standard attention, but with a latent model of syntax that requires no syntactic supervision. We propose two variants: one related to deterministic pushdown automata (PDAs) and one based on nondeterministic PDAs, which allows transformers to recognize arbitrary CFLs. We show that transformers with stack attention are very effective at learning CFLs that standard transformers struggle on, achieving strong results on a CFL with theoretically maximal parsing difficulty. We also show that stack attention is more effective at natural language modeling under a constrained parameter budget, and we include results on machine translation.
翻译:注意力机制,特别是缩放点积注意力,已被证明在自然语言处理中非常有效,但它缺乏处理任意嵌套深度层次结构模式的机制,这限制了其识别特定句法结构的能力。为弥补这一不足,我们提出堆栈注意力(stack attention):一种结合栈的注意力算子,其灵感来源于栈与上下文无关语言(CFL)的理论关联。我们证明堆栈注意力类似于标准注意力,但包含一个无需句法监督的潜在句法模型。我们提出两种变体:一种与确定性下推自动机(PDA)相关,另一种基于非确定性PDA,这使得Transformer能够识别任意上下文无关语言。实验表明,带有堆栈注意力的Transformer在学习标准Transformer难以处理的上下文无关语言方面非常有效,在理论上具有最大解析难度的CFL任务上取得了优异结果。我们还证明,在受限参数预算下,堆栈注意力在自然语言建模中更为有效,并展示了机器翻译任务的结果。