Attention-based transformers have been remarkably successful at modeling generative processes across various domains and modalities. In this paper, we study the behavior of transformers on data drawn from \kth Markov processes, where the conditional distribution of the next symbol in a sequence depends on the previous $k$ symbols observed. We observe a surprising phenomenon empirically which contradicts previous findings: when trained for sufficiently long, a transformer with a fixed depth and $1$ head per layer is able to achieve low test loss on sequences drawn from \kth Markov sources, even as $k$ grows. Furthermore, this low test loss is achieved by the transformer's ability to represent and learn the in-context conditional empirical distribution. On the theoretical side, our main result is that a transformer with a single head and three layers can represent the in-context conditional empirical distribution for \kth Markov sources, concurring with our empirical observations. Along the way, we prove that \textit{attention-only} transformers with $O(\log_2(k))$ layers can represent the in-context conditional empirical distribution by composing induction heads to track the previous $k$ symbols in the sequence. These results provide more insight into our current understanding of the mechanisms by which transformers learn to capture context, by understanding their behavior on Markov sources.
翻译:基于注意力机制的Transformer模型在跨领域和跨模态的生成过程建模方面取得了显著成功。本文研究了Transformer在从k阶马尔可夫过程生成的数据上的行为,其中序列中下一个符号的条件分布取决于观测到的前k个符号。我们通过实验观察到一个与先前发现相矛盾的惊人现象:当经过足够长时间的训练后,具有固定深度且每层仅含1个注意力头的Transformer能够在从k阶马尔可夫源生成的序列上实现较低的测试损失,即使k值不断增大。此外,这种低测试损失是通过Transformer表示和学习上下文条件经验分布的能力实现的。在理论方面,我们的主要结果表明,仅含单头和三层的Transformer能够表示k阶马尔可夫源的上下文条件经验分布,这与我们的实验观察一致。在此过程中,我们证明了仅含O(log₂(k))层的纯注意力Transformer能够通过组合归纳头来追踪序列中的前k个符号,从而表示上下文条件经验分布。这些结果通过理解Transformer在马尔可夫源上的行为,为我们当前理解Transformer学习捕捉上下文的机制提供了更深入的见解。