WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints

Zexuan Wang,Chenghao Yang,Yingqi Que,Zhenzhu Yang,Huaqing Yuan,Yiwen Wang,Zhengxuan Jiang,Shengjie Fang,Zhenhe Wu,Zhaohui Wang,Zhixin Yao,Jiashuo Liu,Jincheng Ren,Yuzhen Li,Yang Yang,Jiaheng Liu,Jian Yang,Zaiyuan Wang,Ge Zhang,Zhoufutu Wen,Wenhao Huang

Real-world autonomous planning requires coordinating tightly coupled constraints where a single decision dictates the feasibility of all subsequent actions. However, existing benchmarks predominantly feature loosely coupled constraints solvable through local greedy decisions and rely on idealized data, failing to capture the complexity of extracting parameters from dynamic web environments. We introduce \textbf{WorldTravel}, a benchmark comprising 150 real-world travel scenarios across 5 cities that demand navigating an average of 15+ interdependent temporal and logical constraints. To evaluate agents in realistic deployments, we develop \textbf{WorldTravel-Webscape}, a multi-modal environment featuring over 2,000 rendered webpages where agents must perceive constraint parameters directly from visual layouts to inform their planning. Our evaluation of 10 frontier models reveals a significant performance collapse: even the state-of-the-art GPT-5.2 achieves only 32.67\% feasibility in text-only settings, which plummets to 19.33\% in multi-modal environments. We identify a critical Perception-Action Gap and a Planning Horizon threshold at approximately 10 constraints where model reasoning consistently fails, suggesting that perception and reasoning remain independent bottlenecks. These findings underscore the need for next-generation agents that unify high-fidelity visual perception with long-horizon reasoning to handle brittle real-world logistics.

翻译：现实世界的自主规划需要协调紧耦合约束，其中单一决策决定了所有后续行动的可行性。然而，现有基准主要包含可通过局部贪婪决策解决的松耦合约束，并依赖于理想化数据，未能捕捉从动态网络环境中提取参数的复杂性。我们引入了 \textbf{WorldTravel}，这是一个包含跨越5个城市的150个真实世界旅行场景的基准，要求导航平均超过15个相互依赖的时间和逻辑约束。为了在现实部署中评估智能体，我们开发了 \textbf{WorldTravel-Webscape}，这是一个多模态环境，包含超过2,000个渲染网页，智能体必须直接从视觉布局中感知约束参数以指导其规划。我们对10个前沿模型的评估揭示了显著的性能崩溃：即使在纯文本设置下，最先进的GPT-5.2也仅实现了32.67\%的可行性，而在多模态环境中这一数字骤降至19.33\%。我们发现了一个关键的感知-行动鸿沟，以及一个大约在10个约束处的规划视野阈值，在此阈值下模型推理持续失败，这表明感知和推理仍然是独立的瓶颈。这些发现强调了下一代智能体需要将高保真视觉感知与长视野推理相统一，以处理脆弱的现实世界物流问题。