LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

Hang He,Chuhuai Yue,Chengqi Dong,Mingxue Tian,Zhenfeng Liu,Jiajun Chai,Xiaohan Wang,Yufei Zhang,Qun Liao,Guojun Yin,Wei Lin,Chengcheng Wan,Haiying Sun,Ting Su

Recent advances in large reasoning models (LRMs) have enabled agentic search systems to perform complex multi-step reasoning across multiple sources. However, most studies focus on general information retrieval and rarely explores vertical domains with unique challenges. In this work, we focus on local life services and introduce LocalSearchBench, which encompass diverse and complex business scenarios. Real-world queries in this domain are often ambiguous and require multi-hop reasoning across merchants and products, remaining challenging and not fully addressed. As the first comprehensive benchmark for agentic search in local life services, LocalSearchBench includes over 150,000 high-quality entries from various cities and business types. We construct 300 multi-hop QA tasks based on real user queries, challenging agents to understand questions and retrieve information in multiple steps. We also developed LocalPlayground, a unified environment integrating multiple tools for agent interaction. Experiments show that even state-of-the-art LRMs struggle on LocalSearchBench: the best model (DeepSeek-V3.1) achieves only 34.34% correctness, and most models have issues with completeness (average 77.33%) and faithfulness (average 61.99%). This highlights the need for specialized benchmarks and domain-specific agent training in local life services. Code, Benchmark, and Leaderboard are available at localsearchbench.github.io.

翻译：大型推理模型（LRMs）的最新进展使得智能搜索系统能够在多个来源上进行复杂的多步骤推理。然而，大多数研究聚焦于通用信息检索，很少探索具有独特挑战的垂直领域。在这项工作中，我们关注本地生活服务领域，并引入了LocalSearchBench，该基准涵盖了多样且复杂的商业场景。该领域中的真实查询通常具有模糊性，并需要跨商户和产品进行多跳推理，这仍然具有挑战性且尚未得到充分解决。作为本地生活服务中智能搜索的首个综合性基准，LocalSearchBench包含了来自不同城市和商业类型的超过150,000条高质量条目。我们基于真实用户查询构建了300个多跳问答任务，挑战智能体理解问题并进行多步骤信息检索。我们还开发了LocalPlayground，这是一个集成了多种工具供智能体交互的统一环境。实验表明，即使是最先进的大型推理模型在LocalSearchBench上也表现不佳：最佳模型（DeepSeek-V3.1）仅达到34.34%的正确率，且大多数模型在完整性（平均77.33%）和忠实性（平均61.99%）方面存在问题。这凸显了在本地生活服务领域需要专门的基准测试和领域特定智能体训练的必要性。代码、基准数据和排行榜可在localsearchbench.github.io获取。