UrbanWell: Benchmarking Multimodal Large Language Models for Spatio-Temporal Urban Wellbeing Analytics

Understanding urban wellbeing from multimodal data requires integrating heterogeneous spatial and temporal signals, posing significant challenges for current multimodal large language models (MLLMs). We introduce UrbanWell, a large-scale benchmark designed to systematically evaluate the spatio-temporal reasoning capabilities of MLLMs for urban wellbeing analytics through joint modeling of satellite and street view imagery. UrbanWell spans 38 cities across multiple years and includes diverse indicators covering (1) environmental conditions (CO$_2$, NO$_2$, PM${2.5}$, and Normalized Difference Vegetation Index), (2) spatial accessibility (minimum distance to supermarkets and restaurants), (3) urban form (road length, road density, and land use), (4) urban vitality (population, economic activity diversity, and land use diversity), and (5) subjective perception attributes (e.g., safety, beauty, liveliness, wealth, and quietness). All indicators are aligned at grid level to enable standardized evaluation. Beyond static prediction, UrbanWell defines temporal reasoning tasks, including future value forecasting from historical observations and temporal trend classification. We benchmark 15 state-of-the-art representative MLLMs in a zero-shot setting, providing a comprehensive comparative evaluation across spatial and temporal dimensions. Experimental results indicate that while MLLMs capture salient spatial and perceptual cues, their performance varies substantially across heterogeneous urban indicators spanning environment and subjective perception. UrbanWell serves as a unified benchmark for evaluating multimodal spatial and temporal reasoning in urban wellbeing analytics, offering a standardized testbed for systematic assessment and future research on multimodal urban intelligence. Our codes and datasets are accessible via https://github.com/axin1301/UrbanWell-Benchmark.

翻译：从多模态数据理解城市福祉需要整合异质的空间与时间信号，这对当前多模态大语言模型构成了重大挑战。我们提出UrbanWell——一个大规模基准测试，旨在通过联合建模卫星影像和街景图像，系统评估多模态大语言模型在城市福祉分析中的时空推理能力。UrbanWell涵盖多个城市的38个年度观测数据，包含多元指标：(1)环境条件（二氧化碳、二氧化氮、PM2.5及归一化植被指数）；(2)空间可达性（至超市和餐馆的最小距离）；(3)城市形态（道路长度、道路密度及土地利用）；(4)城市活力（人口规模、经济活动多样性及土地利用多样性）；(5)主观感知属性（如安全性、美观度、活力、富裕度及宁静度）。所有指标均按网格层级对齐，确保标准化评估。除静态预测外，UrbanWell定义了时间推理任务，包括基于历史观测的未来值预测及时间趋势分类。我们在零样本设置下对15个最先进代表性多模态大语言模型进行基准测试，提供跨越空间与时间维度的全面比较评估。实验结果表明，尽管多模态大语言模型能够捕捉显著空间与感知特征，但其在环境条件与主观感知等异质城市指标上的表现存在显著差异。UrbanWell为城市福祉分析中的多模态时空推理评估提供了统一基准，为系统性评估及多模态城市智能的未来研究构建了标准化测试平台。我们的代码与数据集可通过https://github.com/axin1301/UrbanWell-Benchmark 获取。