Reasoning-capable large language models can be induced to spend their generation budget on injected decoy tasks rather than answering the user's question, causing denial of service when no final answer is produced and denial of wallet when excess output tokens are billed. Input-side safety classifiers often miss these attacks because the injected prompts can appear syntactically benign. We build RecurGuard, a runtime monitor for detecting reasoning-chain consumption attacks when reasoning traces are exposed by the model. RecurGuard analyzes reasoning traces as they are generated and tracks three signals: recurrence rate, volume growth, and progress toward the user's query. If all three signals remain anomalous over three consecutive chunks, RecurGuard terminates generation early. We evaluate RecurGuard against OverThink and ExtendAttack across open-weight reasoning models and conduct adaptive stress tests on DS-R1-Qwen-7B. On this model, RecurGuard detects 99% of OverThink attacks and 92% of ExtendAttack instances while maintaining near-zero false positive rates on question answering, code generation, mathematics, and summarization. Adaptive evaluation reveals the limit of the defense: topical attacks retain 11.9x amplification with an approximately 50% joint miss rate, whereas full semantic evasion reduces amplification from 22.8x to 2.2x. When reasoning traces are unavailable, QDM provides a post-hoc fallback monitor based on the final output.
翻译:具备推理能力的大规模语言模型可能被诱导将其生成预算用于注入的干扰任务,而非回答用户问题,从而引发拒绝服务(未生成最终答案)或因输出令牌超额计费导致的经济损耗。输入侧安全分类器往往难以拦截此类攻击,因为注入提示在语法层面可能表现正常。我们构建了RecurGuard——一种运行时监控系统,用于在模型暴露推理链时检测推理链消耗攻击。RecurGuard实时分析生成的推理链,追踪三类信号:循环率、体积增长量及对用户查询的推进程度。若三个信号在连续三个片段中均保持异常状态,RecurGuard将提前终止生成过程。我们针对OverThink与ExtendAttack两类攻击,在开源权重推理模型上评估了RecurGuard,并对DS-R1-Qwen-7B进行了自适应压力测试。在该模型上,RecurGuard对OverThink攻击的检测率达99%,对ExtendAttack实例的检测率为92%,同时在问答、代码生成、数学推理及摘要生成任务中保持近乎为零的误报率。自适应评估揭示了该防御系统的局限性:主题相关攻击仍可产生11.9倍放大效应(联合漏检率约50%),而完全语义规避策略则将放大倍数从22.8倍降至2.2倍。当推理链不可获取时,QDM可提供基于最终输出的事后回退监控方案。