Large language models display remarkable capabilities in logical and mathematical reasoning, allowing them to solve complex tasks. Interestingly, these abilities emerge in networks trained on the simple task of next-token prediction. In this work, we present a theoretical framework for studying auto-regressive next-token predictors. We demonstrate that even simple models such as linear next-token predictors, trained on Chain-of-Thought (CoT) data, can approximate any function efficiently computed by a Turing machine. We introduce a new complexity measure -- length complexity -- which measures the number of intermediate tokens in a CoT sequence required to approximate some target function, and analyze the interplay between length complexity and other notions of complexity. Finally, we show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer Perceptrons (MLPs), display non-trivial performance on text generation and arithmetic tasks. Our results demonstrate that the power of today's LLMs can be attributed, to a great extent, to the auto-regressive next-token training scheme, and not necessarily to a particular choice of architecture.
翻译:大型语言模型在逻辑与数学推理方面展现出卓越能力,使其能够解决复杂任务。有趣的是,这些能力是在训练于简单的下一词元预测任务的网络中涌现的。在本工作中,我们提出了一个用于研究自回归下一词元预测器的理论框架。我们证明,即使是简单的模型(例如线性下一词元预测器),在思维链数据上训练后,也能高效逼近图灵机可计算的任意函数。我们引入了一种新的复杂度度量——长度复杂度,它衡量了逼近某个目标函数所需的思维链序列中中间词元的数量,并分析了长度复杂度与其他复杂度概念之间的相互作用。最后,我们通过实验表明,简单的下一词元预测器(例如线性网络和浅层多层感知机)在文本生成和算术任务上表现出非平凡的性能。我们的结果表明,当今大型语言模型的能力在很大程度上可归因于自回归下一词元训练方案,而不一定取决于特定的架构选择。