Agentic search has emerged as a promising paradigm for adaptive retrieval systems powered by large language models (LLMs). However, existing benchmarks primarily focus on quality, overlooking efficiency factors that are critical for real-world deployment. Moreover, real-world user queries often contain underspecified preferences, a challenge that remains largely underexplored in current agentic search evaluation. As a result, many agentic search systems remain impractical despite their impressive performance. In this work, we introduce HotelQuEST, a benchmark comprising 214 hotel search queries that range from simple factual requests to complex queries, enabling evaluation across the full spectrum of query difficulty. We further address the challenge of evaluating underspecified user preferences by collecting clarifications that make annotators' implicit preferences explicit for evaluation. We find that LLM-based agents achieve higher accuracy than traditional retrievers, but at substantially higher costs due to redundant tool calls and suboptimal routing that fails to match query complexity to model capability. Our analysis exposes inefficiencies in current agentic search systems and demonstrates substantial potential for cost-aware optimization.
翻译:智能代理搜索已成为由大型语言模型驱动的自适应检索系统的一种有前景的范式。然而,现有基准主要关注质量,忽视了对于实际部署至关重要的效率因素。此外,现实世界中的用户查询常常包含未明确说明的偏好,这一挑战在当前智能代理搜索的评估中仍未得到充分探索。因此,尽管许多智能代理搜索系统性能表现优异,但仍不切实际。在本工作中,我们提出了HotelQuEST基准,它包含214个酒店搜索查询,范围从简单的事实性请求到复杂的查询,从而能够评估全谱系的查询难度。我们进一步通过收集澄清信息来解决评估未明确用户偏好的挑战,这些澄清信息使标注者的隐含偏好变得明确,以便进行评估。我们发现,基于LLM的代理比传统检索器获得了更高的准确率,但其成本也显著更高,这归因于冗余的工具调用以及未能将查询复杂度与模型能力相匹配的次优路由。我们的分析揭示了当前智能代理搜索系统中的低效问题,并展示了成本感知优化的巨大潜力。