Large language models have demonstrated impressive in-context learning (ICL) capability. However, it is still unclear how the underlying transformers accomplish it, especially in more complex scenarios. Toward this goal, several recent works studied how transformers learn fixed-order Markov chains (FOMC) in context, yet natural languages are more suitably modeled by variable-order Markov chains (VOMC), i.e., context trees (CTs). In this work, we study the ICL of VOMC by viewing language modeling as a form of data compression and focus on small alphabets and low-order VOMCs. This perspective allows us to leverage mature compression algorithms, such as context-tree weighting (CTW) and prediction by partial matching (PPM) algorithms as baselines, the former of which is Bayesian optimal for a class of CTW priors. We empirically observe a few phenomena: 1) Transformers can indeed learn to compress VOMC in-context, while PPM suffers significantly; 2) The performance of transformers is not very sensitive to the number of layers, and even a two-layer transformer can learn in-context quite well; and 3) Transformers trained and tested on non-CTW priors can significantly outperform the CTW algorithm. To explain these phenomena, we analyze the attention map of the transformers and extract two mechanisms, on which we provide two transformer constructions: 1) A construction with $D+2$ layers that can mimic the CTW algorithm accurately for CTs of maximum order $D$, 2) A 2-layer transformer that utilizes the feed-forward network for probability blending. One distinction from the FOMC setting is that a counting mechanism appears to play an important role. We implement these synthetic transformer layers and show that such hybrid transformers can match the ICL performance of transformers, and more interestingly, some of them can perform even better despite the much-reduced parameter sets.
翻译:大型语言模型已展现出令人印象深刻的上下文学习能力。然而,其底层Transformer架构如何实现这一能力,尤其是在更复杂的场景中,目前尚不清楚。为探究此目标,近期多项研究探讨了Transformer如何在上下文中学习固定阶马尔可夫链,但自然语言更适宜用变阶马尔可夫链(即上下文树)建模。在本工作中,我们通过将语言建模视为一种数据压缩形式来研究变阶马尔可夫链的上下文学习,并聚焦于小字母表和低阶变阶马尔可夫链。这一视角使我们能够利用成熟的压缩算法(如上下文树加权算法和部分匹配预测算法)作为基线,其中前者对于一类上下文树先验是贝叶斯最优的。我们通过实验观察到若干现象:1)Transformer确实能够在上下文中学习压缩变阶马尔可夫链,而部分匹配预测算法表现显著较差;2)Transformer的性能对层数不敏感,即使是两层Transformer也能很好地完成上下文学习;3)在非上下文树加权先验上训练和测试的Transformer能显著超越上下文树加权算法。为解释这些现象,我们分析了Transformer的注意力图并提取出两种机制,据此提出了两种Transformer构建方案:1)一种具有$D+2$层的构建,能够精确模拟针对最大阶数为$D$的上下文树的上下文树加权算法;2)一种利用前馈网络进行概率混合的两层Transformer。与固定阶马尔可夫链场景的一个区别在于,计数机制似乎起着重要作用。我们实现了这些合成Transformer层,并证明此类混合Transformer能够匹配标准Transformer的上下文学习性能,更有趣的是,尽管参数集大幅减少,其中某些变体甚至能表现更优。