大型推理模型是否具备可中断性？ (Are Large Reasoning Models Interruptible?)

Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the "frozen world" assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model's final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model's partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information.

翻译：大型推理模型（LRMs）在复杂推理任务中表现出色，但传统上是在静态的“冻结世界”设定下进行评估的：模型响应被假定为瞬时完成，且请求的上下文在响应期间被假定为不可变。虽然对于短期任务这一假设通常成立，但在现代推理任务（如辅助编程）中，“冻结世界”假设不再适用，因为模型可能需要数小时来思考问题，且从模型开始思考到最终输出期间，代码可能发生巨大变化。在本工作中，我们挑战了冻结世界假设，并在两种现实的动态场景下评估了LRM的鲁棒性：中断（测试模型在有限预算下部分输出的质量）和动态上下文（测试模型对进行中变化的适应能力）。在需要长篇幅推理的数学和编程基准测试中，静态评估持续高估了鲁棒性：即使在静态设定下达到高准确率的最先进LRM，在被中断或暴露于变化的上下文时，也可能出现不可预测的失败，当更新在推理过程后期引入时，性能下降高达60%。我们的分析进一步揭示了几种新的失效模式，包括推理泄漏（模型在被中断时将推理过程折叠进最终答案）、恐慌（在时间压力下模型完全放弃推理并返回错误答案）以及自我怀疑（在整合更新信息时性能下降）。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日