When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

LLM-based generative agents are increasingly used in urban simulators, yet it remains unclear whether they reproduce empirically realistic human mobility patterns or merely generate plausible mobility narratives. We introduce a validation framework for evaluating the mobility of generative agents of LLM-based urban simulators against real-world mobility data. For this, we use mobility laws, temporal rhythms, network motifs, semantic activity transitions, and behavioral mobility profiles. Using datasets from the Greater Paris region and Shanghai, we evaluate AgentSociety and CitySim across multiple dimensions of mobility realism. Our analysis reveals a substantial gap between narrative plausibility and empirical mobility realism. Although the simulators capture some high-level semantic activity distributions, they struggle to reproduce core spatial and temporal constraints, including realistic trip-length distributions, origin-destination flows, dwell times, and transition dynamics. We further observe that realistic mobility diversity is unstable across default prompting configurations and may require explicit profile-aware initialization. To support reproducible evaluation, we also contribute scalable and open LLM-driven infrastructure for regional-scale map generation, observability-enhanced simulation, mobility-metric computation, and traffic simulation. Our findings highlight the need for rigorous empirical validation of LLM-based urban simulators and provide practical tools for building more realistic and reproducible urban simulation systems.

翻译：基于大语言模型（LLM）的生成代理日益广泛地应用于城市模拟器，但目前尚不明确它们能否复现基于实证的、真实的人类移动模式，抑或仅能生成看似合理的移动叙事。我们引入了一个验证框架，用于评估基于LLM的城市模拟器中生成代理的移动性，并将其与真实世界的移动数据进行对比。为此，我们采用了移动定律、时间节律、网络模体、语义活动转换以及行为移动性画像等方法。利用大巴黎地区和上海的数据集，我们从多个移动性真实性维度评估了AgentSociety和CitySim。我们的分析揭示了叙事合理性与经验移动真实性之间存在显著差距。尽管这些模拟器能够捕捉到一些高层级的语义活动分布，但它们在复现核心的空间与时间约束方面存在困难，包括真实的出行距离分布、起讫点流量、停留时间以及转换动力学。我们进一步观察到，真实的移动多样性在默认提示配置下并不稳定，可能需要显式的、基于画像的初始化。为了支持可复现的评估，我们还贡献了可扩展且开源的、由LLM驱动的基础设施，用于区域尺度地图生成、可观测性增强的模拟、移动性指标计算以及交通模拟。我们的研究结果强调了需对基于LLM的城市模拟器进行严格的经验验证，并为构建更真实、更可复现的城市模拟系统提供了实用工具。