Real-world multi-hop QA is naturally linked with ambiguity, where a single query can trigger multiple reasoning paths that require independent resolution. Since ambiguity can occur at any stage, models must navigate layered uncertainty throughout the entire reasoning chain. Despite its prevalence in real-world user queries, previous benchmarks have primarily focused on single-hop ambiguity, leaving the complex interaction between multi-step inference and layered ambiguity underexplored. In this paper, we introduce \textbf{MARCH}, a benchmark for their intersection, with 2,209 multi-hop ambiguous questions curated via multi-LLM verification and validated by human annotation with strong agreement. Our experiments reveal that even state-of-the-art models struggle with MARCH, confirming that combining ambiguity resolution with multi-step reasoning is a significant challenge. To address this, we propose \textbf{CLARION}, a two-stage agentic framework that explicitly decouples ambiguity planning from evidence-driven reasoning, significantly outperforms existing approaches, and paves the way for robust reasoning systems.
翻译:现实世界的多跳问答自然与歧义相关联,单个查询可能触发多条需要独立解决的推理路径。由于歧义可能出现在任何阶段,模型必须在整个推理链中处理多层次的模糊性。尽管歧义在现实用户查询中普遍存在,但以往的基准测试主要关注单跳歧义,对多步推理与层次化歧义之间的复杂交互关系探索不足。本文提出\textbf{MARCH}基准,专门针对这一交叉领域,包含2,209个通过多LLM验证筛选并经人工标注达成强一致性的多跳歧义问题。实验表明,即使最先进的模型在MARCH上也表现不佳,这证实了将歧义解析与多步推理相结合是一项重大挑战。为此,我们提出\textbf{CLARION}——一个两阶段的智能体框架,其显式解耦歧义规划与证据驱动推理,显著优于现有方法,为构建鲁棒的推理系统开辟了新路径。