Aggregation query over free text is a long-standing yet underexplored problem. Unlike ordinary question answering, aggregate queries require exhaustive evidence collection and systems are required to "find all," not merely "find one." Existing paradigms such as Text-to-SQL and Retrieval-Augmented Generation fail to achieve this completeness. In this work, we formalize entity-level aggregation querying over text in a corpus-bounded setting with strict completeness requirement. To enable principled evaluation, we introduce AGGBench, a benchmark designed to evaluate completeness-oriented aggregation under realistic large-scale corpus. To accompany the benchmark, we propose DFA (Disambiguation--Filtering--Aggregation), a modular agentic baseline that decomposes aggregation querying into interpretable stages and exposes key failure modes related to ambiguity, filtering, and aggregation. Empirical results show that DFA consistently improves aggregation evidence coverage over strong RAG and agentic baselines. The data and code are available in \href{https://anonymous.4open.science/r/DFA-A4C1}.
翻译:自由文本上的聚合查询是一个长期存在但尚未充分探索的问题。与普通问答不同,聚合查询需要穷尽式证据收集,系统必须做到"找全"而非仅仅"找到一个"。现有范式如Text-to-SQL和检索增强生成均无法实现这种完备性。本研究在语料库受限环境下,以严格完备性要求形式化定义了面向文本的实体级聚合查询问题。为建立规范化评估体系,我们提出了AGGBench基准测试,旨在真实大规模语料场景下评估以完备性为导向的聚合能力。配合该基准,我们提出DFA(消歧-过滤-聚合)模块化智能体基线方法,将聚合查询分解为可解释的阶段,并揭示与歧义性、过滤及聚合过程相关的关键失效模式。实验结果表明,相较于强大的RAG及智能体基线方法,DFA能持续提升聚合证据的覆盖度。相关数据与代码已发布于\href{https://anonymous.4open.science/r/DFA-A4C1}。