From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning

Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach does not align with how humans actually solve problems. Human cognition naturally decomposes problem-solving into two distinct stages: first acquiring abstract strategies (i.e., meta-knowledge) that generalize across problems, then adapting them to specific instances. In contrast, by treating complete trajectories as basic units, current methods are inherently problem-centric, entangling abstract strategies with problem-specific execution. To address this misalignment, we propose a cognitively-inspired framework that explicitly mirrors the two-stage human cognitive process. Specifically, Chain-of-Meta-Thought (CoMT) focuses supervised learning on abstract reasoning patterns without specific executions, enabling acquisition of generalizable strategies. Confidence-Calibrated Reinforcement Learning (CCRL) then optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing overconfident errors from cascading and improving execution reliability. Experiments across four models and eight benchmarks show 2.19\% and 4.63\% improvements in-distribution and out-of-distribution respectively over standard methods, while reducing training time by 65-70% and token consumption by 50%, demonstrating that aligning post-training with human cognitive principles yields not only superior generalization but also enhanced training efficiency.

翻译：当前LLM后训练方法通过监督微调（SFT）与基于结果的强化学习（RL）对完整推理轨迹进行优化。尽管有效，深入分析揭示了一个根本性缺陷：该方法与人类实际解决问题的方式并不一致。人类认知天然将问题解构为两个独立阶段：首先获取可跨问题泛化的抽象策略（即元知识），随后将其适配至具体实例。相比之下，当前方法将完整轨迹作为基本单元，本质上是问题中心化的，导致抽象策略与问题特定执行过程相互纠缠。为解决这种错位，我们提出一个受认知启发的框架，显式模拟人类两阶段认知过程。具体而言，元思维链（CoMT）将监督学习聚焦于不含具体执行步骤的抽象推理模式，从而获取可泛化的策略。随后，置信度校准强化学习（CCRL）通过中间步骤的置信感知奖励优化任务适配，防止过度自信误差的级联传播并提升执行可靠性。在四个模型与八个基准测试上的实验表明，相较于标准方法，本框架在分布内与分布外场景分别实现了2.19%与4.63%的性能提升，同时减少65-70%的训练时间与50%的令牌消耗，这证明将后训练与人类认知原则对齐不仅能获得更优的泛化能力，还可显著提升训练效率。