Pre-trained large language models based on Transformers have demonstrated remarkable in-context learning (ICL) abilities. With just a few demonstration examples, the models can implement new tasks without any parameter updates. However, it is still an open question to understand the mechanism of ICL. In this paper, we attempt to explore the ICL process in Transformers through a lens of representation learning. Initially, leveraging kernel methods, we figure out a dual model for one softmax attention layer. The ICL inference process of the attention layer aligns with the training procedure of its dual model, generating token representation predictions that are equivalent to the dual model's test outputs. We delve into the training process of this dual model from a representation learning standpoint and further derive a generalization error bound related to the quantity of demonstration tokens. Subsequently, we extend our theoretical conclusions to more complicated scenarios, including one Transformer layer and multiple attention layers. Furthermore, drawing inspiration from existing representation learning methods especially contrastive learning, we propose potential modifications for the attention layer. Finally, experiments are designed to support our findings.
翻译:基于Transformer架构的预训练大语言模型展现出卓越的上下文学习能力。仅需少量示例,模型无需参数更新即可执行新任务。然而,理解ICL的内在机制仍是开放性问题。本文尝试通过表征学习的视角探究Transformer中的ICL过程。首先,借助核方法,我们推导出单层softmax注意力机制的对偶模型。该注意力层的ICL推理过程与其对偶模型的训练过程相吻合,其生成的词元表征预测等价于对偶模型的测试输出。我们从表征学习角度深入分析该对偶模型的训练过程,并进一步推导出与示例词元数量相关的泛化误差界。随后,我们将理论结论拓展至更复杂场景,包括完整Transformer层及多层注意力架构。此外,受现有表征学习方法(特别是对比学习)启发,我们提出注意力层的潜在改进方案。最后,通过实验设计验证了理论发现。