Large language models (LLMs) increasingly solve difficult problems by producing "reasoning traces" before emitting a final response. However, it remains unclear how accuracy and decision commitment evolve along a reasoning trajectory, and whether intermediate trace segments provide answer-relevant information beyond generic length or stylistic effects. Here, we propose a protocol to systematically probe the trajectories of reasoning traces in LLMs by 1) generating a model's reasoning trace, 2) truncating it at fixed token-percentiles, and 3) injecting each partial trace back into the model (or a different model) to measure the induced distribution over answer choices via next-token probabilities. We apply this protocol to the open-source Qwen3-4B/-8B/-14B and gpt-oss-20b/-120b models across the multiple-choice GPQA Diamond and MMLU-Pro benchmarks. We find that accuracy and decision commitment consistently increase as the percentage of provided reasoning tokens grows. These gains are primarily driven by relevant content in the model generation rather than context length or generic "reasoning style" effects. Stronger models often backtrack successfully from incorrect partial traces, but immediate answers often remain anchored in the weaker model's incorrect response. More broadly, we show that trajectory probing provides diagnostics for efficient and safer deployment of reasoning models as the measurements can inform practical trace-handling and monitoring policies that improve reliability without assuming intermediate tokens are inherently faithful explanations.
翻译:大语言模型(LLMs)越来越多地通过生成“推理轨迹”来解决复杂问题,随后才输出最终答案。然而,目前尚不清楚推理轨迹中准确性和决策确定性如何随进程演变,也不确定中间轨迹片段是否提供了超越一般长度或风格效应的、与答案相关的信息。为此,我们提出一种系统性探究LLMs推理轨迹的协议,其步骤包括:1)生成模型的推理轨迹;2)按固定词元百分比截断轨迹;3)将每个部分轨迹重新输入原模型(或不同模型),通过下一词元概率测量其引发的答案选项分布。我们将该协议应用于开源模型Qwen3-4B/-8B/-14B和gpt-oss-20b/-120b,并在多项选择题基准GPQA Diamond和MMLU-Pro上进行测试。研究发现,随着提供的推理词元比例增加,准确性和决策确定性持续提升。这些提升主要由模型生成内容的相关性驱动,而非上下文长度或通用的“推理风格”效应。较强模型常能从错误的部分轨迹中成功回溯修正,但较弱模型的错误答案往往会使即时回答持续受其锚定影响。更广泛而言,我们证明轨迹探究可为推理模型的高效安全部署提供诊断依据,因为相关测量能为实际轨迹处理与监控策略提供信息,从而在不假设中间词元天然具备忠实解释性的前提下提升系统可靠性。