Recent search agents leverage multi-turn reasoning and search tools to achieve strong performance on multi-hop and long-horizon benchmarks. Yet it remains unclear whether they reliably reason across all requirements by tracking, verifying, and maintaining multiple conditions in these questions. We study this capability under multi-constraint problems, where valid answers must satisfy several constraints simultaneously. We find that illusory completion frequently occurs, wherein agents believe tasks are complete despite unresolved or violated constraints, leading to underverified answers. To diagnose this behavior, we introduce the Epistemic Ledger, an evaluation framework that tracks evidential support and agents' beliefs for each constraint throughout multi-turn reasoning. Our analysis reveals four recurring failure patterns: bare assertions, overlooked refutations, stagnation, and premature exit. Motivated by these findings, we examine whether explicit constraint-state tracking during execution mitigates these failures via LiveLedger, an inference-time tracker. This simple intervention consistently improves performance, substantially reducing underverified answers (by up to 26.5%) and improving overall accuracy (by up to 11.6%) on multi-constraint problems.
翻译:近期,搜索智能体通过多轮推理与搜索工具的结合,在多跳与长视野基准测试中展现出卓越性能。然而,这些智能体是否能够通过追踪、验证并维持问题中的多重条件来可靠地进行推理,目前尚不明确。本研究聚焦于多约束问题下的推理能力,此类问题要求有效答案必须同时满足多项约束条件。我们发现虚幻完成现象频繁出现:智能体在约束条件未解决或被违反的情况下仍认为任务已完成,从而导致验证不足的答案。为诊断此行为,我们引入了认知账本——一种评估框架,用于在多轮推理过程中追踪每条约束的证据支持度与智能体的信念状态。分析揭示了四种反复出现的失败模式:空泛断言、忽视反驳、推理停滞与过早退出。基于这些发现,我们探究了在执行过程中显式追踪约束状态是否能够缓解此类失败。为此,我们提出了LiveLedger——一种推理时追踪器。这一简单干预措施持续提升了性能,在多约束问题上显著减少了验证不足的答案(降幅最高达26.5%),并整体提升了准确率(增幅最高达11.6%)。