Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.
翻译:以BrowseComp为代表的搜索代理基准在过去一年迅速饱和,最强模型已超过90%准确率。由于这些基准主要由人类编写,标注者缺乏对实体统计数据的全局视角,无法系统性地最大化搜索空间规模和结构复杂度,这形成了难以突破的难度天花板。为此我们提出LoHoSearch(长时域搜索代理),这是一个包含544个跨11个领域、经人工验证问题的挑战性基准。LoHoSearch通过自动化流水线构建,该流水线基于覆盖超过700万维基百科实体的知识图谱,选择具有大搜索空间的关系,并将其组装成经知识图谱验证具有唯一答案的结构复杂问题。评估表明,即使最强模型也只能达到34.74%的准确率,现有上下文管理策略(最佳+6.8%)带来的提升远小于先前的基准。LoHoSearch为评估搜索代理的长时域推理和上下文管理能力树立了更高标准。