The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real-world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries that reflect authentic information-seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization. Notably, GISA provides complete human search trajectories for every query, offering gold-standard references for process-level supervision and imitation learning. Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30\% exact match score, with performance notably degrading on tasks requiring complex planning and comprehensive information gathering. These findings highlight substantial room for future improvement.
翻译:大型语言模型(LLM)的进展显著加速了能够通过多轮网络交互自主收集信息的搜索代理的开发。已有多种基准测试被提出以评估此类代理。然而,现有基准通常从答案反向构建查询,产生了与真实世界需求不符的非自然任务。此外,这些基准往往侧重于定位特定信息或聚合多源信息,同时依赖于容易发生数据污染的静态答案集。为弥补这些不足,我们提出了GISA——一个面向通用信息检索助手的基准测试,包含373个人工构建的、反映真实信息检索场景的查询。GISA设计了四种结构化答案格式(项目、集合、列表和表格),支持确定性评估。它将深度推理与广泛信息聚合统一整合在任务中,并包含一个答案定期更新的动态子集以抵抗记忆效应。值得注意的是,GISA为每个查询提供了完整的人类搜索轨迹,为过程级监督和模仿学习提供了黄金标准参考。在主流LLM和商业搜索产品上的实验表明,即使表现最佳的模型也仅获得19.30%的精确匹配分数,且在需要复杂规划和全面信息收集的任务上性能显著下降。这些发现凸显了未来改进的巨大空间。