Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction objective for rapidly eliciting arithmetic capabilities.
翻译:像GPT-4这样的大语言模型,即使无监督的下一词预测目标并未明确编码这些任务,在大量文本数据上训练后,也能在通用任务(如基础算术)中展现出涌现能力。本研究探讨了从随机初始化开始训练的小型Transformer,如何利用下一词预测目标高效学习加法、乘法等算术运算及平方根等基本函数。我们首先证明,传统训练数据并非算术学习最有效的方式,而简单的格式调整即可显著提升准确率。这导致了以训练数据规模为函数的尖锐相变现象,在某些情况下可通过与低秩矩阵补全的关联得到解释。基于先前工作,我们进一步采用包含中间步骤结果的链式思维风格数据进行训练。即使在完全缺乏预训练的情况下,该方法也能显著同步提升准确率、样本复杂度及收敛速度。我们还研究了训练过程中算术数据与文本数据的相互作用,并考察了少样本提示、预训练及模型规模的影响。此外,我们讨论了长度泛化挑战。我们的工作凸显了高质量、指导性数据对于快速激发算术能力的重要性——这类数据需充分考虑下一词预测目标的独特特性。