Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the "frozen world" assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model's final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model's partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information.
翻译:大型推理模型(LRMs)在复杂推理任务中表现出色,但传统上是在静态的“冻结世界”设定下进行评估的:模型响应被假定为瞬时完成,且请求的上下文在响应期间被假定为不可变。虽然对于短期任务这一假设通常成立,但在现代推理任务(如辅助编程)中,“冻结世界”假设不再适用,因为模型可能需要数小时来思考问题,且从模型开始思考到最终输出期间,代码可能发生巨大变化。在本工作中,我们挑战了冻结世界假设,并在两种现实的动态场景下评估了LRM的鲁棒性:中断(测试模型在有限预算下部分输出的质量)和动态上下文(测试模型对进行中变化的适应能力)。在需要长篇幅推理的数学和编程基准测试中,静态评估持续高估了鲁棒性:即使在静态设定下达到高准确率的最先进LRM,在被中断或暴露于变化的上下文时,也可能出现不可预测的失败,当更新在推理过程后期引入时,性能下降高达60%。我们的分析进一步揭示了几种新的失效模式,包括推理泄漏(模型在被中断时将推理过程折叠进最终答案)、恐慌(在时间压力下模型完全放弃推理并返回错误答案)以及自我怀疑(在整合更新信息时性能下降)。