We investigate whether language models internally track the value of their current trajectory, defined as the likelihood that their ongoing strategy will achieve their goals. Using synthetic, in-context reinforcement learning data, we construct a "value" axis for Qwen3-8B. We find that activations along this axis distinguish between high vs. low verbalized confidence, rollouts without and with backtracking, and correct vs. corrupted code. Steering towards high value causally suppresses self-correction and reduces explanatory verbosity, while steering towards low value induces backtracking and exploration. We demonstrate that direct preference optimization (DPO) can increase the internal value of rewarded behaviors (e.g. use a certain word), causing the model to act more confidently after exhibiting them. Finally, we apply the value axis to study in-the-wild settings. For example, we find that Qwen assigns low value to politically sensitive chat queries after post-training and that supervised fine-tuning increases internal confidence within the training domain. Our results suggest that language models linearly encode an estimate of expected goal success that modulates their confidence in pursuing a direction.
翻译:我们探究语言模型是否在内部追踪其当前轨迹的价值,定义为当前策略实现目标的可能性。利用合成的上下文强化学习数据,我们为 Qwen3-8B 构建了一个“价值”轴。研究发现,沿该轴的激活能区分高/低口头置信度、无回溯/有回溯的展开路径,以及正确/损坏的代码。沿高价值方向引导会因果性地抑制自我纠正并降低解释冗余性,而沿低价值方向引导则会诱导回溯与探索。我们证明,直接偏好优化(DPO)能提升奖励行为(如使用特定词汇)的内在价值,使模型在展示这些行为后表现出更高的置信度。最后,我们将价值轴应用于真实场景的研究。例如,我们发现 Qwen 在训练后对政治敏感聊天查询赋予低价值,且监督微调会提升训练领域内的内在置信度。结果表明,语言模型线性编码对期望目标成功概率的估计,这一估计调节了模型在追求某一方向时的置信度。