Transformers have the capacity to act as supervised learning algorithms: by properly encoding a set of labeled training ("in-context") examples and an unlabeled test example into an input sequence of vectors of the same dimension, the forward pass of the transformer can produce predictions for that unlabeled test example. A line of recent work has shown that when linear transformers are pre-trained on random instances for linear regression tasks, these trained transformers make predictions using an algorithm similar to that of ordinary least squares. In this work, we investigate the behavior of linear transformers trained on random linear classification tasks. Via an analysis of the implicit regularization of gradient descent, we characterize how many pre-training tasks and in-context examples are needed for the trained transformer to generalize well at test-time. We further show that in some settings, these trained transformers can exhibit "benign overfitting in-context": when in-context examples are corrupted by label flipping noise, the transformer memorizes all of its in-context examples (including those with noisy labels) yet still generalizes near-optimally for clean test examples.
翻译:Transformer具备作为监督学习算法的能力:通过将一组带标签的训练(“上下文”)样本和一个无标签的测试样本编码为相同维度的向量输入序列,Transformer的前向传播过程可为该无标签测试样本生成预测。近期一系列研究表明,当线性Transformer在随机线性回归任务实例上进行预训练后,这些经训练的Transformer会采用与普通最小二乘法相似的算法进行预测。本研究探讨了在随机线性分类任务上训练的线性Transformer的行为特征。通过对梯度下降隐式正则化的分析,我们量化了要使训练后的Transformer在测试阶段实现良好泛化所需的预训练任务数量与上下文样本数量。进一步研究发现,在某些设定下,这些经训练的Transformer能够呈现“上下文学习中的良性过拟合”现象:当上下文样本受到标签翻转噪声干扰时,Transformer会记忆全部上下文样本(包括含噪声标签的样本),却仍能对洁净测试样本实现接近最优的泛化性能。