Large language models (LLMs) have proven to be remarkably efficient, both across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the origins of their impressive performance remains elusive. In this paper, we approach this challenging task by drawing an equivalence between generic autoregressive language models with vocabulary of size $T$ and context window of size $K$ and Markov chains defined on a finite state space of size $\mathcal{O}(T^K)$. We derive several surprising findings related to the existence of a stationary distribution of Markov chains that capture the inference power of LLMs, their speed of convergence to it, and the influence of the temperature on the latter. We then prove pre-training and in-context generalization bounds and show how the drawn equivalence allows us to enrich their interpretation. Finally, we illustrate our theoretical guarantees with experiments on several recent LLMs to highlight how they capture the behavior observed in practice.
翻译:大型语言模型(LLMs)已被证明具有显著的效率,不仅广泛适用于自然语言处理任务,其能力范围更远超于此。然而,对其卓越性能来源的全面理论分析仍然难以捉摸。在本文中,我们通过建立词汇表大小为 $T$、上下文窗口大小为 $K$ 的通用自回归语言模型与定义在大小为 $\mathcal{O}(T^K)$ 的有限状态空间上的马尔可夫链之间的等价关系,来应对这一具有挑战性的任务。我们推导出若干令人惊讶的发现,涉及捕捉LLMs推理能力的马尔可夫链平稳分布的存在性、向其收敛的速度,以及温度对后者的影响。随后,我们证明了预训练与上下文泛化的理论界限,并展示了所建立的等价关系如何有助于丰富其解释。最后,我们通过在多个近期LLMs上的实验来说明我们的理论保证,以突显其如何捕捉实践中观察到的行为。