Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired with accurate evidence documents. The useful evidence resides in massive data lakes, making search a prerequisite for answering. However, there is a lack of comprehensive benchmarks that require both searching and reasoning over large data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes searching and reasoning capabilities. LakeQA is built on a heterogeneous collection of approximately 9.5 TB of text resources from Wikipedia and open-source government data, spanning structured and unstructured data. To ensure task quality, each sample is annotated by at least one Ph.D.-level expert. Each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct documents and then compose evidence across sources to produce the answer. Experimental results on seven frontier LLMs demonstrate that LakeQA is challenging. For instance, GPT-5.2 achieves only an exact-match score of 18.37% on LakeQA. Overall, LakeQA provides a realistic testbed for developing LLM agents that can both find and analyze data in modern data lakes.
翻译:近期大型语言模型在基于阅读的问答任务中取得了快速进展,此类任务中证据明确提供或可通过简单检索获取。然而,现实问题往往缺乏精准的证据文档匹配,有效证据通常隐藏在海量数据湖中,使得搜索成为回答问题的前提条件。但当前尚缺乏要求在大规模数据湖中同时进行搜索与推理的综合基准。为此,我们提出LakeQA——一个面向数据湖的、以搜索为核心的问答综合基准,该基准同时强调搜索与推理能力。LakeQA基于维基百科和开源政府数据中约9.5TB的异构文本资源构建,涵盖结构化与非结构化数据。为确保任务质量,每条样本至少由一名博士级专家标注。每个任务均需进行包含隐式中间步骤的多跳长程推理:智能体需先定位相关文档,再跨来源整合证据以生成答案。在七个前沿大语言模型上的实验结果表明,LakeQA具有挑战性。例如,GPT-5.2在LakeQA上的精确匹配得分仅为18.37%。总体而言,LakeQA为开发能在现代数据湖中实现数据发现与分析的大语言模型智能体提供了真实测试环境。