We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.
翻译:我们提出了Urban-ImageNet,这是一个基于用户生成社交媒体图像的大规模多模态数据集与评估基准,旨在研究城市空间感知。该语料库包含200余万张公开社交媒体图像及其配对的文本帖子,这些数据来源于2019至2025年间中国24个城市中61个城市站点的新浪微博平台,并划分出1K、10K和100K规模的受控基准子集,以及用于大规模训练和评估的完整200万级语料。Urban-ImageNet依据HUSIC(分层城市空间图像分类框架)进行组织,该框架定义了一个基于城市理论的10类分类体系。此分类旨在区分活跃与非活跃公共空间、外部与内部城市环境、住宿空间、消费内容、人物肖像以及非空间的社交媒体内容。Urban-ImageNet并非将城市图像视为通用场景数据,而是评估机器感知模型能否捕捉对城市研究至关重要的空间、社会及功能差异。该基准在一个标准化库中支持三项任务:(T1)城市场景语义分类、(T2)跨模态图文检索和(T3)实例分割。我们的实验评估了代表性视觉模型、视觉-语言模型和分割模型,结果表明在监督式场景分类上性能强劲,但在跨模态检索和城市物体实例级分割方面面临更具挑战性的表现。一项多尺度研究进一步检验了当平衡训练数据从1K、10K增至100K幅图像时模型性能的变化。Urban-ImageNet提供了一个统一、基于理论、多城市的基准,用于评估人工智能系统如何跨模态、跨尺度和不同任务形式感知与解读当代城市空间。数据集和基准可通过以下网址获取:huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet 和 github.com/yiasun/dataset-2。