Many neural network architectures have been shown to be Turing Complete, and can thus implement arbitrary algorithms. However, Transformers are unique in that they can implement gradient-based learning algorithms \emph{under simple parameter configurations}. A line of recent work shows that linear Transformers naturally learn to implement gradient descent (GD) when trained on a linear regression in-context learning task. But the linearity assumption (either in the Transformer architecture or in the learning task) is far from realistic settings where non-linear activations crucially enable Transformers to learn complicated non-linear functions. In this paper, we provide theoretical and empirical evidence that non-linear Transformers can, and \emph{in fact do}, learn to implement learning algorithms to learn non-linear functions in context. Our results apply to a broad class of combinations of non-linear architectures, and non-linear in-context learning tasks. Interestingly, we show that the optimal choice of non-linear activation depends in a natural way on the non-linearity of the learning task.
翻译:许多神经网络架构已被证明是图灵完备的,因此可以实现任意算法。然而,Transformer的独特之处在于,它们能在简单的参数配置下实现基于梯度的学习算法。近期一系列研究表明,当在线性回归的上下文学习任务上进行训练时,线性Transformer自然学会实现梯度下降(GD)。但线性假设(无论是Transformer架构还是学习任务中的)远非现实场景,在现实中,非线性激活函数至关重要地使Transformer能够学习复杂的非线性函数。本文提供理论和实验证据,证明非线性Transformer能够——且实际上确实——学会实现学习算法以在上下文中学习非线性函数。我们的结果适用于一大类非线性架构与非线性上下文学习任务的组合。有趣的是,我们展示了非线性激活函数的最优选择自然地取决于学习任务的非线性程度。