Transformers are ubiquitous models in the natural language processing (NLP) community and have shown impressive empirical successes in the past few years. However, little is understood about how they reason and the limits of their computational capabilities. These models do not process data sequentially, and yet outperform sequential neural models such as RNNs. Recent work has shown that these models can compactly simulate the sequential reasoning abilities of deterministic finite automata (DFAs). This leads to the following question: can transformers simulate the reasoning of more complex finite state machines? In this work, we show that transformers can simulate weighted finite automata (WFAs), a class of models which subsumes DFAs, as well as weighted tree automata (WTA), a generalization of weighted automata to tree structured inputs. We prove these claims formally and provide upper bounds on the sizes of the transformer models needed as a function of the number of states the target automata. Empirically, we perform synthetic experiments showing that transformers are able to learn these compact solutions via standard gradient-based training.
翻译:变压器(Transformers)是自然语言处理(NLP)领域无处不在的模型,并在过去几年中展现出令人瞩目的实证成功。然而,人们对其推理机制及计算能力的极限了解甚少。这些模型并非顺序处理数据,却能在性能上超越如循环神经网络(RNN)等顺序神经模型。近期研究表明,这些模型能够简洁地模拟确定性有限自动机(DFA)的顺序推理能力。这引出了以下问题:变压器能否模拟更复杂的有穷状态机的推理过程?在本工作中,我们证明变压器能够模拟加权有限自动机(WFA)——一类包含DFA的模型,以及加权树自动机(WTA)——加权自动机在树结构输入上的推广。我们形式化地证明了这些论断,并给出了作为目标自动机状态数函数的变压器模型所需规模的上界。在实证方面,我们通过合成实验表明,变压器能够通过基于梯度的标准训练方法学习这些紧凑解决方案。