Attention, specifically scaled dot-product attention, has proven effective for natural language, but it does not have a mechanism for handling hierarchical patterns of arbitrary nesting depth, which limits its ability to recognize certain syntactic structures. To address this shortcoming, we propose stack attention: an attention operator that incorporates stacks, inspired by their theoretical connections to context-free languages (CFLs). We show that stack attention is analogous to standard attention, but with a latent model of syntax that requires no syntactic supervision. We propose two variants: one related to deterministic pushdown automata (PDAs) and one based on nondeterministic PDAs, which allows transformers to recognize arbitrary CFLs. We show that transformers with stack attention are very effective at learning CFLs that standard transformers struggle on, achieving strong results on a CFL with theoretically maximal parsing difficulty. We also show that stack attention is more effective at natural language modeling under a constrained parameter budget, and we include results on machine translation.
翻译:注意力机制,特别是缩放点积注意力,在自然语言处理中已证明其有效性,但该机制缺乏处理任意嵌套深度层级模式的能力,这限制了其识别特定句法结构的能力。为弥补这一缺陷,我们提出堆栈注意力:一种融入堆栈结构的注意力算子,其灵感源自堆栈与上下文无关语言的理论联系。研究表明,堆栈注意力与标准注意力机制类似,但具备无需语法监督的潜在句法模型。我们提出两种变体:一种基于确定性下推自动机,另一种基于非确定性下推自动机,后者使Transformer能够识别任意上下文无关语言。实验证明,配备堆栈注意力的Transformer在标准Transformer难以处理的上下文无关语言学习任务中表现优异,并在理论上具有最大解析难度的上下文无关语言上取得显著成果。我们还发现,在参数预算受限条件下,堆栈注意力在自然语言建模中更为有效,并给出了机器翻译任务的实验结果。