Recently, large language models (LLMs) with extensive general knowledge and powerful reasoning abilities have seen rapid development and widespread application. A systematic and reliable evaluation of LLMs or vision-language model (VLMs) is a crucial step in applying and developing them for various fields. There have been some early explorations about the usability of LLMs for limited urban tasks, but a systematic and scalable evaluation benchmark is still lacking. The challenge in constructing a systematic evaluation benchmark for urban research lies in the diversity of urban data, the complexity of application scenarios and the highly dynamic nature of the urban environment. In this paper, we design CityBench, an interactive simulator based evaluation platform, as the first systematic benchmark for evaluating the capabilities of LLMs for diverse tasks in urban research. First, we build CityData to integrate the diverse urban data and CitySimu to simulate fine-grained urban dynamics. Based on CityData and CitySimu, we design 8 representative urban tasks in 2 categories of perception-understanding and decision-making as the CityBench. With extensive results from 30 well-known LLMs and VLMs in 13 cities around the world, we find that advanced LLMs and VLMs can achieve competitive performance in diverse urban tasks requiring commonsense and semantic understanding abilities, e.g., understanding the human dynamics and semantic inference of urban images. Meanwhile, they fail to solve the challenging urban tasks requiring professional knowledge and high-level reasoning abilities, e.g., geospatial prediction and traffic control task. These observations provide valuable perspectives for utilizing and developing LLMs in the future. Codes are openly accessible via https://github.com/tsinghua-fib-lab/CityBench.
翻译:近年来,具备广泛通用知识和强大推理能力的大语言模型(LLMs)发展迅速并得到广泛应用。对LLMs或视觉语言模型(VLMs)进行系统且可靠的评估,是将其应用于各领域并进行开发的关键步骤。目前已有一些关于LLMs在有限城市任务中可用性的初步探索,但仍缺乏一个系统化且可扩展的评估基准。构建面向城市研究的系统性评估基准的挑战在于城市数据的多样性、应用场景的复杂性以及城市环境的高度动态性。本文设计了CityBench,一个基于交互式模拟器的评估平台,作为首个用于评估LLMs在城市研究中多样化任务能力的系统性基准。首先,我们构建了CityData以整合多样化的城市数据,并构建了CitySimu以模拟细粒度的城市动态。基于CityData和CitySimu,我们设计了涵盖感知-理解与决策两大类别的8项代表性城市任务,构成CityBench。通过对全球13个城市中30个知名LLMs和VLMs的广泛测试,我们发现先进的LLMs和VLMs在需要常识和语义理解能力的多样化城市任务中(例如理解人类动态和城市图像的语义推理)能够取得有竞争力的表现。同时,它们难以解决需要专业知识和高级推理能力的挑战性城市任务(例如地理空间预测和交通控制任务)。这些观察结果为未来利用和开发LLMs提供了有价值的视角。代码可通过 https://github.com/tsinghua-fib-lab/CityBench 公开获取。