Existing visual place recognition (VPR) datasets predominantly rely on vehicle-mounted imagery, offer limited multimodal diversity, and underrepresent dense pedestrian street scenes, particularly in non-Western urban contexts. We introduce MMS-VPR, a large-scale multimodal dataset for street-level place recognition in pedestrian-only environments. MMS-VPR comprises 110,529 images and 2,527 video clips across 208 locations in a ~70,800 $m^2$ open-air commercial district in Chengdu, China. Field data were collected in 2024, while social media data span seven years (2019-2025), providing both fine-grained temporal granularity and long-term temporal coverage. Each location features comprehensive day-night coverage, multiple viewing angles, and multimodal annotations including GPS coordinates, timestamps, and semantic textual metadata. We further release MMS-VPRlib, a unified benchmarking platform that consolidates commonly used VPR datasets and state-of-the-art methods under a standardized, reproducible pipeline. MMS-VPRlib provides modular components for data pre-processing, multimodal modeling (CNN/RNN/Transformer), signal enhancement, alignment, fusion, and performance evaluation. This platform moves beyond traditional image-only paradigms, enabling systematic exploitation of complementary visual, video, and textual modalities. The dataset is available at https://huggingface.co/datasets/Yiwei-Ou/MMS-VPR and the benchmark at https://github.com/yiasun/MMS-VPRlib.
翻译:现有的视觉位置识别数据集主要依赖车载图像,多模态多样性有限,且对密集行人街景(尤其是在非西方城市环境中)的代表性不足。我们提出了MMS-VPR,一个用于纯行人环境街景位置识别的大规模多模态数据集。MMS-VPR包含来自中国成都约70,800平方米露天商业区208个地点的110,529张图像和2,527个视频片段。实地数据采集于2024年,而社交媒体数据跨越七年(2019-2025年),既提供了细粒度的时间分辨率,也实现了长期的时间覆盖。每个地点均具备完整的昼夜覆盖、多视角拍摄以及多模态标注,包括GPS坐标、时间戳和语义文本元数据。我们进一步发布了MMS-VPRlib,一个统一的基准测试平台,该平台将常用的VPR数据集和最先进的方法整合到一个标准化、可复现的流程中。MMS-VPRlib提供了数据预处理、多模态建模(CNN/RNN/Transformer)、信号增强、对齐、融合以及性能评估的模块化组件。该平台超越了传统的纯图像范式,能够系统性地利用互补的视觉、视频和文本模态。数据集发布于 https://huggingface.co/datasets/Yiwei-Ou/MMS-VPR,基准测试平台发布于 https://github.com/yiasun/MMS-VPRlib。