With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable by retrieving a single relevant passage, making them poorly suited for measuring cross-source sensemaking, such as integrating evidence, tracking causal links, and resolving dependencies across facets of a topic. We present iAgentBench, a dynamic ODQA benchmark that targets these higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior. iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct user-like questions whose answers require combining evidence from multiple sources, not just extracting a single snippet. Each instance is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis. Experiments across multiple LLMs show that retrieval improves accuracy, but retrieval alone does not reliably resolve these questions, underscoring the need to evaluate evidence use, not just evidence access.
翻译:随着支持搜索的生成式问答系统兴起,用户越来越多地转向能够代表他们浏览、聚合和整合多源证据的工具。然而,许多广泛使用的问答基准仍可通过检索单篇相关段落来回答,这使得它们难以适用于衡量跨源感知能力,例如整合证据、追踪因果链以及解决主题多维度间的依赖关系。我们提出了iAgentBench——一个动态开放域问答基准,该基准针对这些更高层次的信息需求,同时保持问题的自然性并基于真实信息寻求行为构建。iAgentBench从现实世界的注意力信号中提取种子主题,并利用常见的用户意图模式构建类用户问题,其答案需要整合多源证据而非仅提取单一文本片段。每个实例均附带可追溯的证据和可审计的中间产物,支持污染检测并能对检索与合成阶段的失败进行细粒度诊断。在多类大语言模型上的实验表明,检索能提升准确率,但仅靠检索无法可靠解决此类问题,这凸显了评估证据使用(而非仅评估证据获取)的必要性。