Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an "induction head" mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties.
翻译:基于Transformer的大型语言模型已取得显著实证成功。然而,随着其广泛应用,为提升可靠性,亟需深入理解其内部机制。此类模型既能存储训练数据中的海量知识,又能快速适应上下文或提示中的新信息。我们通过构建合成实验(其中词元由全局或上下文相关的二元组分布生成)来研究Transformer如何平衡这两类知识。通过对简化双层Transformer训练过程的细致实证分析,我们阐明了全局二元组的快速学习过程,以及上下文二元组"归纳头"机制的渐进发展。我们重点揭示了权重矩阵作为联想式记忆的关键作用,从理论层面阐明了梯度如何驱动训练中的学习过程,并探讨了数据分布特性的影响。