Reasoning-trained language models often spend more tokens on harder problems, but longer chains of thought do not show whether a model is merely computing for more steps or following a different internal trajectory. We study this distinction through hidden-state trajectories during chain-of-thought generation across competitive programming, mathematics, and Boolean satisfiability. Raw trajectory geometry is strongly shaped by generation length: longer generations mechanically alter path statistics, so difficulty-dependent comparisons are misleading without adjustment. After residualizing trajectory statistics on length, difficulty remains systematically coupled to corrected trajectory geometry across all domains studied. The clearest reasoning-specific separation appears in the code domain, where harder problems show more direct corrected trajectories and less heterogeneous local curvature in reasoning-trained models than in matched instruction-tuned baselines. Corrected difficulty-geometry coupling is weaker, but still present, in mathematics and Boolean satisfiability. Prompt-stage linear probes do not mirror the code-domain separation, and behavioral annotations show that stronger corrected coupling co-occurs with strategy shifts and uncertainty monitoring. Together, these findings establish length correction as a prerequisite for generation-time trajectory analysis and show that reasoning training can be associated with distinct corrected trajectory geometry, with the strength of the effect depending on the domain.
翻译:经过推理训练的语言模型通常在更困难的问题上会生成更多token,但更长的思维链并不能表明模型只是在计算更多步骤,还是遵循了不同的内部轨迹。我们通过竞技编程、数学和布尔可满足性问题中思维链生成期间的隐藏状态轨迹来研究这一区别。原始轨迹几何形态强烈受到生成长度的影响:更长的生成过程会机械性地改变路径统计特性,因此未经调整的难度依赖型比较会产生误导。在对轨迹统计数据进行长度残差化处理后,难度在所有研究领域中仍然系统性地与修正后的轨迹几何形态相关联。最清晰的推理特异性分离出现在代码领域:与经过指令调整的匹配基线模型相比,在推理训练模型中,更困难的问题表现出更直接的修正后轨迹,以及更不异质的局部曲率。在数学和布尔可满足性问题中,修正后的难度-几何关联性较弱,但仍然存在。提示阶段线性探针并未反映代码领域的分离现象,而行为注释表明,更强的修正后关联性与策略转变和不确定性监测同时出现。这些发现共同确立了长度修正是生成时轨迹分析的前提条件,并表明推理训练可能与独特的修正后轨迹几何形态相关联,其效应强度取决于具体领域。