Recent advances in reasoning models and agentic AI systems have led to an increased reliance on diverse external information. However, this shift introduces input contexts that are inherently noisy, a reality that current sanitized benchmarks fail to capture. We introduce NoisyBench, a comprehensive benchmark that systematically evaluates model robustness across 11 datasets in RAG, reasoning, alignment, and tool-use tasks against diverse noise types, including random documents, irrelevant chat histories, and hard negative distractors. Our evaluation reveals a catastrophic performance drop of up to 80% in state-of-the-art models when faced with contextual distractors. Crucially, we find that agentic workflows often amplify these errors by over-trusting noisy tool outputs, and distractors can trigger emergent misalignment even without adversarial intent. We find that prompting, context engineering, SFT, and outcome-reward only RL fail to ensure robustness; in contrast, our proposed Rationale-Aware Reward (RARE) significantly strengthens resilience by incentivizing the identification of helpful information within noise. Finally, we uncover an inverse scaling trend where increased test-time computation leads to worse performance in noisy settings and demonstrate via attention visualization that models disproportionately focus on distractor tokens, providing vital insights for building the next generation of robust, reasoning-capable agents.
翻译:近期推理模型与智能体AI系统的进展导致了对多样化外部信息的日益依赖。然而,这种转变引入了本质上存在噪声的输入上下文,而当前经过净化的基准测试未能捕捉这一现实。我们提出了NoisyBench,这是一个综合性基准,系统评估了模型在RAG、推理、对齐和工具使用任务中,针对包括随机文档、无关聊天历史和困难负样本干扰项在内的多种噪声类型,在11个数据集上的鲁棒性。我们的评估显示,最先进的模型在面对上下文干扰项时性能出现灾难性下降,降幅高达80%。关键的是,我们发现智能体工作流常常因过度信任噪声工具输出而放大这些错误,并且干扰项即使在没有对抗意图的情况下也能触发突发性失准。我们发现提示工程、上下文工程、监督微调以及仅基于结果奖励的强化学习均无法确保鲁棒性;相比之下,我们提出的理性感知奖励通过激励模型识别噪声中的有效信息,显著增强了鲁棒性。最后,我们发现了一个反向缩放趋势,即增加测试时计算量反而导致在噪声环境下的性能更差,并通过注意力可视化证明模型会不成比例地关注干扰项标记,这为构建下一代鲁棒且具备推理能力的智能体提供了至关重要的见解。