Compositional Generalization from Learned Skills via CoT Training: A Theoretical and Structural Analysis for Reasoning

Chain-of-Thought (CoT) training has markedly advanced the reasoning capabilities of large language models (LLMs), yet the mechanisms by which CoT training enhances generalization remain inadequately understood. In this work, we demonstrate that compositional generalization is fundamental: models systematically combine simpler learned skills during CoT training to address novel and more complex problems. Through a theoretical and structural analysis, we formalize this process: 1) Theoretically, the information-theoretic generalization bounds through distributional divergence can be decomposed into in-distribution (ID) and out-of-distribution (OOD) components. Specifically, the non-CoT models fail on OOD tasks due to unseen compositional patterns, whereas CoT-trained models achieve strong generalization by composing previously learned skills. In addition, controlled experiments and real-world validation confirm that CoT training accelerates convergence and enhances generalization from ID to both ID and OOD scenarios while maintaining robust performance even with tolerable noise. 2) Structurally, CoT training internalizes reasoning into a two-stage compositional circuit, where the number of stages corresponds to the explicit reasoning steps during training. Notably, CoT-trained models resolve intermediate results at shallower layers compared to non-CoT counterparts, freeing up deeper layers to specialize in subsequent reasoning steps. A key insight is that CoT training teaches models how to think-by fostering compositional reasoning-rather than merely what to think, through the provision of correct answers alone. This paper offers valuable insights for designing CoT strategies to enhance LLMs' reasoning robustness.

翻译：思维链（CoT）训练显著提升了大型语言模型（LLM）的推理能力，然而CoT训练增强泛化能力的内在机制仍未得到充分理解。本研究论证组合泛化是其根本机制：模型在CoT训练过程中系统性地组合已习得的简单技能，以解决新颖且更复杂的问题。通过理论与结构分析，我们形式化这一过程：1）理论上，基于分布散度的信息论泛化界可分解为分布内（ID）与分布外（OOD）分量。具体而言，非CoT模型因未见过组合模式而在OOD任务上失效，而CoT训练模型通过组合先前习得的技能实现了强泛化能力。此外，控制实验与真实场景验证证实，CoT训练能加速收敛并提升从ID到ID及OOD场景的泛化性能，同时在可容忍噪声下保持稳健性能。2）结构上，CoT训练将推理过程内化为两阶段组合电路，其阶段数量与训练期间显式推理步骤相对应。值得注意的是，相较于非CoT模型，CoT训练模型在更浅层网络即解析中间结果，从而释放深层网络专注于后续推理步骤。关键发现在于：CoT训练通过培养组合推理能力——而非仅通过提供正确答案来告知结论——教会模型如何思考。本文为设计增强LLM推理鲁棒性的CoT策略提供了重要见解。