Large pretrained language models have shown surprising in-context learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without parameter updates. Despite the great success in performance, its working mechanism still remains an open question. In this paper, we explain language models as meta-optimizers and understand in-context learning as implicit finetuning. Theoretically, we figure out that Transformer attention has a dual form of gradient descent. On top of it, we understand ICL as follows: GPT first produces meta-gradients according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build an ICL model. We comprehensively compare the behaviors of in-context learning and explicit finetuning on real tasks to provide empirical evidence that supports our understanding. Experimental results show that in-context learning behaves similarly to explicit finetuning from multiple perspectives. Inspired by the dual form between Transformer attention and gradient descent, we design a momentum-based attention by analogy with gradient descent with momentum. The improved performance over vanilla attention further supports our understanding from another perspective, and more importantly, shows the potential to utilize our understanding for future model design. The code is available at \url{https://aka.ms/icl}.
翻译:大型预训练语言模型展现出惊人的上下文学习能力。通过少量示例输入-标签对,它们能在不更新参数的情况下预测未见输入的标签。尽管在性能上取得了巨大成功,其工作机制仍是一个未解之谜。本文提出将语言模型解释为元优化器,并将上下文学习理解为隐式微调过程。理论上,我们发现Transformer注意力机制与梯度下降存在对偶形式。基于此,我们将上下文学习理解为:GPT首先根据示范样例生成元梯度,随后这些元梯度被应用于原始GPT以构建上下文学习模型。我们通过真实任务系统比较了上下文学习与显式微调的行为特征,为这一理解提供了实证支持。实验结果表明,上下文学习在多个维度上与显式微调展现相似特性。受Transformer注意力与梯度下降对偶关系的启发,我们类比动量梯度下降设计了动量注意力机制。该机制相较于标准注意力取得的性能提升,既从新角度验证了我们的理解,更揭示了利用这一认知指导未来模型设计的潜力。代码见 \url{https://aka.ms/icl}。