Large language models (LLMs) increasingly combine long-context processing with advanced reasoning, enabling them to retrieve and synthesize information distributed across tens of thousands of tokens. A hypothesis is that stronger reasoning capability should improve safety by helping models recognize harmful intent even when it is not stated explicitly. We test this hypothesis in long-context settings where harmful intent is implicit and must be inferred through reasoning, and find that it does not hold. We introduce compositional reasoning attacks, a new threat model in which a harmful query is decomposed into incomplete fragments that scattered throughout a long context. The model is then prompted with a neutral reasoning query that induces retrieval and synthesis, causing the harmful intent to emerge only after composition. Evaluating 14 frontier LLMs on contexts up to 64k tokens, we uncover three findings: (1) models with stronger general reasoning capability are not more robust to compositional reasoning attacks, often assembling the intent yet failing to refuse; (2) safety alignment consistently degrades as context length increases; and (3) inference-time reasoning effort is a key mitigating factor: increasing inference-time compute reduces attack success by over 50 percentage points on GPT-oss-120b model. Together, these results suggest that safety does not automatically scale with reasoning capability, especially under long-context inference.
翻译:大型语言模型(LLMs)日益融合长上下文处理与高级推理能力,使其能够检索和综合分布在数万个标记中的信息。一种假设认为,更强的推理能力应通过帮助模型识别即使未明确陈述的有害意图来提升安全性。我们在长上下文设置中检验这一假设,其中有害意图是隐性的,必须通过推理推断,结果发现该假设并不成立。我们提出组合推理攻击这一新型威胁模型,其中有害查询被分解为不完整的片段,分散在长上下文中。随后,模型被一个中性的推理查询提示,该查询诱导检索与综合,导致有害意图仅在组合后显现。通过对14个前沿LLM在长达64k标记的上下文上进行评估,我们揭示了三个发现:(1)具备更强通用推理能力的模型并未对组合推理攻击表现出更强的鲁棒性,它们常常组合出有害意图却未能拒绝执行;(2)安全性对齐效果随上下文长度增加而持续下降;(3)推理时计算投入是关键缓解因素:在GPT-oss-120b模型上,增加推理时计算可将攻击成功率降低超过50个百分点。这些结果表明,安全性并不会随推理能力自动提升,尤其是在长上下文推理场景下。