Large language models display remarkable capabilities in logical and mathematical reasoning, allowing them to solve complex tasks. Interestingly, these abilities emerge in networks trained on the simple task of next-token prediction. In this work, we present a theoretical framework for studying auto-regressive next-token predictors. We demonstrate that even simple models such as linear next-token predictors, trained on Chain-of-Thought (CoT) data, can approximate any function efficiently computed by a Turing machine. We introduce a new complexity measure -- length complexity -- which measures the number of intermediate tokens in a CoT sequence required to approximate some target function, and analyze the interplay between length complexity and other notions of complexity. Finally, we show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer Perceptrons (MLPs), display non-trivial performance on text generation and arithmetic tasks. Our results demonstrate that the power of language models can be attributed, to a great extent, to the auto-regressive next-token training scheme, and not necessarily to a particular choice of architecture.
翻译:大型语言模型在逻辑与数学推理方面展现出卓越能力,使其能够解决复杂任务。有趣的是,这些能力源于以简单任务——下一个词元预测——训练的神经网络。本文提出了一个研究自回归下一个词元预测器的理论框架。我们证明,即使是诸如线性下一个词元预测器之类的简单模型,在思想链(CoT)数据上训练后,也能近似图灵机有效计算的任何函数。我们引入了一种新的复杂度度量——长度复杂度,用以衡量在CoT序列中逼近目标函数所需的中间词元数量,并分析了长度复杂度与其他复杂度概念之间的相互作用。最后,我们通过实验表明,诸如线性网络和浅层多层感知器(MLPs)等简单的下一个词元预测器,在文本生成和算术任务上展现出非平凡的性能。我们的结果表明,语言模型的能力在很大程度上可归因于自回归下一个词元训练方案,而非特定架构的选择。