Transformers excel empirically on tasks that process well-formed inputs according to some grammar, such as natural language and code. However, it remains unclear how they can process grammatical syntax. In fact, under standard complexity conjectures, standard transformers cannot recognize context-free languages (CFLs), a canonical formalism to describe syntax, or even regular languages, a subclass of CFLs. Past work proves that $\mathcal{O}(\log(n))$ looping layers (w.r.t. input length n) allows transformers to recognize regular languages, but the question of context-free recognition remained open. In this work, we show that looped transformers with $\mathcal{O}(\log(n))$ looping layers and $\mathcal{O}(n^6)$ padding tokens can recognize all CFLs. However, training and inference with $\mathcal{O}(n^6)$ padding tokens is potentially impractical. Fortunately, we show that, for natural subclasses such as unambiguous CFLs, the recognition problem on transformers becomes more tractable, requiring $\mathcal{O}(n^3)$ padding. We empirically validate our results and show that looping helps on a language that provably requires logarithmic depth. Overall, our results shed light on the intricacy of CFL recognition by transformers: While general recognition may require an intractable amount of padding, natural constraints such as unambiguity yield efficient recognition algorithms.
翻译:Transformer模型在处理符合特定语法结构的输入(如自然语言和代码)任务中展现出卓越的实证性能。然而,其处理语法结构的具体机制尚不明确。事实上,在标准计算复杂性假设下,标准Transformer模型无法识别描述语法结构的经典形式化模型——上下文无关语言,甚至无法识别其子类正则语言。已有研究证明,通过引入$\mathcal{O}(\log(n))$层循环结构(相对于输入长度n),Transformer能够识别正则语言,但上下文无关语言的识别问题仍未解决。本研究证明,具有$\mathcal{O}(\log(n))$层循环结构和$\mathcal{O}(n^6)$个填充标记的循环Transformer能够识别所有上下文无关语言。然而,使用$\mathcal{O}(n^6)$个填充标记进行训练和推理可能缺乏实际可行性。值得庆幸的是,我们发现对于自然子类(如无歧义上下文无关语言),Transformer的识别问题变得更为可行,仅需$\mathcal{O}(n^3)$个填充标记。我们通过实验验证了理论结果,并证明循环结构在处理可证明需要对数深度的语言任务中具有积极作用。总体而言,我们的研究揭示了Transformer识别上下文无关语言的复杂性:虽然通用识别可能需要不可行的填充标记数量,但通过引入无歧义性等自然约束条件,可以获得高效的识别算法。