Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.
翻译:先前的研究探索了Transformer模型在模拟布尔电路或图灵机时的计算表达能力。然而,这些模拟器从观测数据中的可学习性一直是一个悬而未决的问题。我们的研究填补了这一空白,首次为具有线性注意力的单层Transformer提供了多项式时间可学习性结果(具体而言是强、不可知PAC学习)。我们证明,线性注意力可被视为在适当定义的再生核希尔伯特空间(RKHS)中的线性预测器。因此,学习任何线性Transformer的问题可以转化为在扩展特征空间中学习普通线性预测器的问题,并且任何此类预测器都可以转换回多头线性Transformer。转向泛化性,我们展示了如何高效地识别训练数据集,使得每个经验风险最小化器(在平凡对称性范围内)等价于生成数据的线性Transformer,从而保证学习到的模型能在所有输入上正确泛化。最后,我们提供了可通过线性注意力表达并因此在多项式时间内可学习的计算示例,包括关联记忆、有限自动机以及一类具有多项式有界计算历史的通用图灵机(UTM)。我们在三个任务上实证验证了我们的理论发现:学习随机线性注意力网络、键值关联以及学习执行有限自动机。我们的发现弥合了Transformer理论表达能力和可学习性之间的关键鸿沟,并表明灵活且通用的计算模型是高效可学习的。