Recent advances in post-training techniques have endowed Large Language Models (LLMs) with enhanced capabilities for tackling complex, logic-intensive tasks through the generation of supplementary planning tokens. This development raises a fundamental question: Are these models aware of what they "learn" and "think"? To address this, we define three core competencies: (1) awareness of learned latent policies, (2) generalization of these policies across domains, and (3) alignment between internal reasoning traces and final outputs. We empirically evaluate these abilities on several tasks, each designed to require learning a distinct policy. Furthermore, we contrast the profiles of models post-trained via Supervised Fine-Tuning (SFT), Direct Policy Optimization (DPO), and Group Relative Policy Optimization (GRPO). Our findings indicate that RL-trained models not only demonstrate greater awareness of their learned behaviors and stronger generalizability to novel, structurally similar tasks than SFT models but also often exhibit weak alignment between their reasoning traces and final outputs, an effect most pronounced in GRPO-trained models.
翻译:近期后训练技术的进展,使大型语言模型(LLMs)通过生成补充性规划标记(planning tokens),具备了处理复杂、强逻辑性任务的增强能力。这一发展引发了一个本质性问题:这些模型是否意识到它们"学习"和"思考"的内容?为解决此问题,我们定义了三种核心能力:(1)对已学习潜在策略的觉察能力;(2)这些策略跨领域的泛化能力;(3)内部推理过程与最终输出之间的一致性。我们在多个任务上对这些能力进行了实证评估,每个任务均要求学习不同的策略。此外,我们对比了通过监督微调(SFT)、直接策略优化(DPO)和群体相对策略优化(GRPO)后训练的模型特征。研究结果表明:经强化学习训练的模型不仅比SFT模型更善于觉察自身行为模式,并对结构相似的新任务表现出更强的泛化能力,但其推理过程与最终输出之间往往存在较弱的一致性——这一效应在GRPO训练模型中最为显著。