Despite their impressive performance in NLP, self-attention networks were recently proved to be limited for processing formal languages with hierarchical structure, such as $\mathsf{Dyck}_k$, the language consisting of well-nested parentheses of $k$ types. This suggested that natural language can be approximated well with models that are too weak for formal languages, or that the role of hierarchy and recursion in natural language might be limited. We qualify this implication by proving that self-attention networks can process $\mathsf{Dyck}_{k, D}$, the subset of $\mathsf{Dyck}_{k}$ with depth bounded by $D$, which arguably better captures the bounded hierarchical structure of natural language. Specifically, we construct a hard-attention network with $D+1$ layers and $O(\log k)$ memory size (per token per layer) that recognizes $\mathsf{Dyck}_{k, D}$, and a soft-attention network with two layers and $O(\log k)$ memory size that generates $\mathsf{Dyck}_{k, D}$. Experiments show that self-attention networks trained on $\mathsf{Dyck}_{k, D}$ generalize to longer inputs with near-perfect accuracy, and also verify the theoretical memory advantage of self-attention networks over recurrent networks.
翻译:尽管自注意力网络在自然语言处理中表现卓越,但近期研究证明其在处理具有层级结构的正式语言(如由$k$种类型括号构成嵌套结构的$\mathsf{Dyck}_k$语言)时存在局限性。这暗示自然语言可通过弱于正式语言的模型进行近似拟合,或层级结构与递归在自然语言中的作用可能有限。我们通过证明自注意力网络能够处理$\mathsf{Dyck}_{k, D}$(即深度受$D$限制的$\mathsf{Dyck}_k$子集)来限定上述推论的内涵,该子集更切合地刻画了自然语言的有界层级结构。具体而言,我们构建了具有$D+1$层和$O(\log k)$存储容量(每层每令牌)的硬注意力网络用于识别$\mathsf{Dyck}_{k, D}$,以及两层结构、$O(\log k)$存储容量的软注意力网络用于生成$\mathsf{Dyck}_{k, D}$。实验表明,在$\mathsf{Dyck}_{k, D}$上训练的自注意力网络能以近完美精度泛化至更长输入,并验证了自注意力网络相较于循环网络在理论上的存储优势。