Many neural network architectures have been shown to be Turing Complete, and can thus implement arbitrary algorithms. However, Transformers are unique in that they can implement gradient-based learning algorithms \emph{under simple parameter configurations}. A line of recent work shows that linear Transformers naturally learn to implement gradient descent (GD) when trained on a linear regression in-context learning task. But the linearity assumption (either in the Transformer architecture or in the learning task) is far from realistic settings where non-linear activations crucially enable Transformers to learn complicated non-linear functions. In this paper, we provide theoretical and empirical evidence that non-linear Transformers can, and \emph{in fact do}, learn to implement learning algorithms to learn non-linear functions in context. Our results apply to a broad class of combinations of non-linear architectures, and non-linear in-context learning tasks. Interestingly, we show that the optimal choice of non-linear activation depends in a natural way on the non-linearity of the learning task.
翻译:许多神经网络架构已被证明具有图灵完备性,因此可以实现任意算法。然而,Transformers 的独特之处在于它们能在简单的参数配置下实现基于梯度的学习算法。近期一系列研究表明,在线性回归的上下文学习任务上训练时,线性 Transformers 会自然地学会实现梯度下降(gradient descent, GD)。但线性假设(无论是 Transformer 架构还是学习任务)远非现实场景,其中非线性激活函数至关重要地使 Transformers 能够学习复杂的非线性函数。本文从理论和实证角度证明,非线性 Transformers 能够且实际上确实学会了实现学习算法,以在上下文中学习非线性函数。我们的结果适用于广泛的非线性架构与非线性上下文学习任务的组合。有趣的是,我们展示了非线性激活函数的最优选择自然地取决于学习任务的非线性特征。