Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this, we conduct a systematic evaluation of multiple reasoning models across three scenarios: (1) problems augmented with lengthy, irrelevant context; (2) multi-turn conversational settings with independent tasks; and (3) problems presented as a subtask within a complex task. We observe an interesting phenomenon: reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation. A finer-grained analysis reveals that this compression is associated with a decrease in self-verification and uncertainty management behaviors, such as double-checking. While this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks. We hope our findings draw additional attention to both the robustness of reasoning models and the problem of context management for LLMs and LLM-based agents.
翻译:大语言模型(LLMs)若展现出测试时扩展行为,如扩展推理轨迹和自验证,已在复杂、长期推理任务中表现出卓越性能。然而,这些推理行为的鲁棒性仍未被充分探索。为此,我们对多种推理模型在三种场景下进行系统性评估:(1)附加大量无关上下文的问题;(2)包含独立任务的多轮对话情境;(3)作为复杂任务子任务呈现的问题。我们发现一个有趣现象:当相同问题在不同上下文条件下呈现时,推理模型产生的推理轨迹显著缩短(最多缩短50%),远低于问题单独呈现时的轨迹长度。更细粒度的分析表明,这种压缩与自验证和不确定性管理行为(如双重检查)的减少相关。虽然这种行为偏移不会损害简单问题的性能,但可能影响更具挑战性任务的表现。我们期望这些发现能引起学界对推理模型鲁棒性及大语言模型与基于LLM的智能体上下文管理问题的更多关注。