Chain-of-thought (CoT) reasoning has become a widely used mechanism for eliciting multi-step reasoning in large language models by generating intermediate reasoning steps at inference time. Yet the scaling behavior of generalization with CoT depth remains poorly understood. To address this question, we study a theoretically solvable model of CoT for in-context weight prediction in linear regression, where test-time reasoning is represented as an iterative refinement of the weight-parameter estimate. Using tools from random matrix theory under high-dimensional asymptotics, we derive an exact formula for the generalization error as a function of reasoning depth, pretraining data amount, and context length. Our analysis reveals a sharp phase transition separating exponential and polynomial improvement, saturation, and overthinking, and characterizes how the optimal reasoning depth scales. We further show that deeper reasoning is most effective with sufficiently rich pretraining and in-context information, whereas limited pretraining or context makes longer reasoning prone to error amplification or saturation. We also validate these predictions through experiments on fully learned linear attention and softmax attention models. Our results provide a unified theoretical account of how test-time CoT depth affects generalization.
翻译:链式思考(CoT)推理已成为一种广泛使用的机制,通过在推理时生成中间推理步骤来激发大型语言模型的多步推理能力。然而,随着CoT深度增加,泛化能力的缩放行为仍未被充分理解。为解决这一问题,我们研究了一个理论上可解的线性回归上下文权重预测CoT模型,其中测试时推理被表示为权重参数估计的迭代优化。利用高维渐近框架下的随机矩阵理论工具,我们推导出泛化误差作为推理深度、预训练数据量和上下文长度的精确公式。我们的分析揭示了从指数改进到多项式改进、饱和及过度思考之间的尖锐相变,并刻画了最优推理深度的缩放规律。我们进一步表明,更深的推理在具备足够丰富的预训练和上下文信息时最为有效,而有限的预训练或上下文会导致较长推理更容易出现误差放大或饱和。我们还在完全学习的线性注意力与Softmax注意力模型上通过实验验证了这些预测。我们的结果系统性地从理论上解释了测试时CoT深度如何影响泛化能力。