Travel planning is a natural real-world task to test large language models (LLMs) planning and tool-use abilities. Although prior work has studied LLM performance on travel planning, existing settings still differ from real-world needs, mainly due to limited domain coverage, insufficient modeling of users' implicit preferences in multi-turn conversations, and a lack of clear evaluation of agents' capability boundaries. To mitigate these gaps, we propose \textbf{TravelBench}, a benchmark for fully real-world travel planning. We collect user queries, user profile and tools from real scenarios, and construct three subtasks-Single-Turn, Multi-Turn, and Unsolvable-to evaluate agent's three core capabilities in real settings: (1) solving problems autonomously, (2) interacting with users over multiple turns to refine requirements, and (3) recognizing the limits of own abilities. To enable stable tool invocation and reproducible evaluation, we cache real tool-call results and build a sandbox environment that integrates ten travel-related tools. Agents can combine these tools to solve most practical travel planning problems, and our systematic verification demonstrates the stability of the proposed benchmark. We further evaluate multiple LLMs on TravelBench and conduct an in-depth analysis of their behaviors and performance. TravelBench provides a practical and reproducible evaluation benchmark to advance research on LLM agents for travel planning.\footnote{Our code and data will be available after internal review.
翻译:旅行规划是检验大型语言模型规划与工具使用能力的天然真实世界任务。尽管先前工作已研究大型语言模型在旅行规划上的表现,但现有设定仍与真实世界需求存在差异,主要源于领域覆盖有限、对多轮对话中用户隐式偏好的建模不足,以及缺乏对智能体能力边界的清晰评估。为弥补这些差距,我们提出\textbf{TravelBench},一个面向完全真实世界旅行规划的基准。我们从真实场景收集用户查询、用户档案与工具,并构建三个子任务——单轮次、多轮次与不可解任务——以评估智能体在真实环境中的三项核心能力:(1)自主解决问题,(2)通过多轮交互与用户共同细化需求,以及(3)识别自身能力局限。为实现稳定的工具调用与可复现的评估,我们缓存了真实工具调用结果并构建了一个集成十种旅行相关工具的沙箱环境。智能体可组合使用这些工具解决大多数实际旅行规划问题,且我们的系统验证证明了所提基准的稳定性。我们进一步在TravelBench上评估了多个大型语言模型,并对其行为与性能进行了深入分析。TravelBench为推进面向旅行规划的大型语言模型智能体研究提供了一个实用且可复现的评估基准。\footnote{我们的代码与数据将在内部评审后公开。}