Recent advances in LLMs, particularly in language reasoning and tool integration, have rapidly sparked the real-world development of Language Agents. Among these, travel planning represents a prominent domain, combining academic challenges with practical value due to its complexity and market demand. However, existing benchmarks fail to reflect the diverse, real-world requirements crucial for deployment. To address this gap, we introduce ChinaTravel, a benchmark specifically designed for authentic Chinese travel planning scenarios. We collect the travel requirements from questionnaires and propose a compositionally generalizable domain-specific language that enables a scalable evaluation process, covering feasibility, constraint satisfaction, and preference comparison. Empirical studies reveal the potential of neuro-symbolic agents in travel planning, achieving a constraint satisfaction rate of 27.9%, significantly surpassing purely neural models at 2.6%. Moreover, we identify key challenges in real-world travel planning deployments, including open language reasoning and unseen concept composition. These findings highlight the significance of ChinaTravel as a pivotal milestone for advancing language agents in complex, real-world planning scenarios.
翻译:近年来,大型语言模型(LLM)在语言推理与工具集成方面的进展,迅速推动了语言智能体在现实世界中的发展。其中,旅行规划因其复杂性和市场需求,成为兼具学术挑战与实用价值的突出领域。然而,现有基准未能充分反映实际部署所需的关键且多样化的现实世界需求。为弥补这一差距,我们提出了ChinaTravel,一个专为真实中文旅行规划场景设计的基准。我们从问卷中收集旅行需求,并提出一种具备组合泛化能力的领域特定语言,以实现可扩展的评估流程,涵盖可行性、约束满足度与偏好比较。实证研究揭示了神经符号智能体在旅行规划中的潜力,其约束满足率达到27.9%,显著超越纯神经模型的2.6%。此外,我们识别了现实世界旅行规划部署中的关键挑战,包括开放语言推理与未见概念组合。这些发现凸显了ChinaTravel作为推动语言智能体在复杂现实世界规划场景中发展的重要里程碑意义。