Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach does not align with how humans actually solve problems. Human cognition naturally decomposes problem-solving into two distinct stages: first acquiring abstract strategies (i.e., meta-knowledge) that generalize across problems, then adapting them to specific instances. In contrast, by treating complete trajectories as basic units, current methods are inherently problem-centric, entangling abstract strategies with problem-specific execution. To address this misalignment, we propose a cognitively-inspired framework that explicitly mirrors the two-stage human cognitive process. Specifically, Chain-of-Meta-Thought (CoMT) focuses supervised learning on abstract reasoning patterns without specific executions, enabling acquisition of generalizable strategies. Confidence-Calibrated Reinforcement Learning (CCRL) then optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing overconfident errors from cascading and improving execution reliability. Experiments across four models and eight benchmarks show 2.19\% and 4.63\% improvements in-distribution and out-of-distribution respectively over standard methods, while reducing training time by 65-70% and token consumption by 50%, demonstrating that aligning post-training with human cognitive principles yields not only superior generalization but also enhanced training efficiency.
翻译:当前LLM后训练方法通过监督微调(SFT)与基于结果的强化学习(RL)对完整推理轨迹进行优化。尽管有效,深入分析揭示了一个根本性缺陷:该方法与人类实际解决问题的方式并不一致。人类认知天然将问题解构为两个独立阶段:首先获取可跨问题泛化的抽象策略(即元知识),随后将其适配至具体实例。相比之下,当前方法将完整轨迹作为基本单元,本质上是问题中心化的,导致抽象策略与问题特定执行过程相互纠缠。为解决这种错位,我们提出一个受认知启发的框架,显式模拟人类两阶段认知过程。具体而言,元思维链(CoMT)将监督学习聚焦于不含具体执行步骤的抽象推理模式,从而获取可泛化的策略。随后,置信度校准强化学习(CCRL)通过中间步骤的置信感知奖励优化任务适配,防止过度自信误差的级联传播并提升执行可靠性。在四个模型与八个基准测试上的实验表明,相较于标准方法,本框架在分布内与分布外场景分别实现了2.19%与4.63%的性能提升,同时减少65-70%的训练时间与50%的令牌消耗,这证明将后训练与人类认知原则对齐不仅能获得更优的泛化能力,还可显著提升训练效率。