In recent years, transformer-based models have revolutionized deep learning, particularly in sequence modeling. To better understand this phenomenon, there is a growing interest in using Markov input processes to study transformers. However, our current understanding in this regard remains limited with many fundamental questions about how transformers learn Markov chains still unanswered. In this paper, we address this by focusing on first-order Markov chains and single-layer transformers, providing a comprehensive characterization of the learning dynamics in this context. Specifically, we prove that transformer parameters trained on next-token prediction loss can either converge to global or local minima, contingent on the initialization and the Markovian data properties, and we characterize the precise conditions under which this occurs. To the best of our knowledge, this is the first result of its kind highlighting the role of initialization. We further demonstrate that our theoretical findings are corroborated by empirical evidence. Based on these insights, we provide guidelines for the initialization of transformer parameters and demonstrate their effectiveness. Finally, we outline several open problems in this arena. Code is available at: https://github.com/Bond1995/Markov.
翻译:近年来,基于Transformer的模型彻底改变了深度学习领域,尤其在序列建模方面。为了更好地理解这一现象,利用马尔可夫输入过程来研究Transformer引起了日益增长的兴趣。然而,我们目前对此的理解仍然有限,关于Transformer如何学习马尔可夫链的许多基本问题尚未得到解答。本文通过聚焦于一阶马尔可夫链和单层Transformer,对此背景下的学习动态进行了全面刻画。具体而言,我们证明了在下一词预测损失上训练的Transformer参数可以收敛到全局最小值或局部最小值,这取决于初始化方式和马尔可夫数据的特性,并且我们精确刻画了发生这种情况的条件。据我们所知,这是首个强调初始化作用的研究结果。我们进一步通过实验证据验证了理论发现。基于这些见解,我们为Transformer参数的初始化提供了指导原则并证明了其有效性。最后,我们指出了该领域的若干开放性问题。代码可在以下网址获取:https://github.com/Bond1995/Markov。