MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps

Maps are powerful carriers of structured and contextual knowledge, encompassing geography, demographics, infrastructure, and environmental patterns. Reasoning over such knowledge requires models to integrate spatial relationships, visual cues, real-world context, and domain-specific expertise-capabilities that current large language models (LLMs) and vision-language models (VLMs) still struggle to exhibit consistently. Yet, datasets used to benchmark VLMs on map-based reasoning remain narrow in scope, restricted to specific domains, and heavily reliant on artificially generated content (outputs from LLMs or pipeline-based methods), offering limited depth for evaluating genuine geospatial reasoning. To address this gap, we present MapVerse, a large-scale benchmark built on real-world maps. It comprises 11,837 human-authored question-answer pairs across 1,025 maps, spanning ten diverse map categories and multiple question categories for each. The dataset provides a rich setting for evaluating map reading, interpretation, and multimodal reasoning. We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps. Beyond overall performance, we conduct fine-grained categorical analyses to assess model inference across multiple dimensions and investigate the visual factors shaping reasoning outcomes. Our findings reveal that while current VLMs perform competitively on classification-style tasks, both open- and closed-source models fall short on advanced tasks requiring complex spatial reasoning.

翻译：地图是结构化与情境化知识的强大载体，涵盖地理、人口、基础设施及环境模式等多维信息。对此类知识进行推理要求模型能够整合空间关系、视觉线索、现实世界语境及领域专业知识——这些能力正是当前大语言模型（LLMs）与视觉语言模型（VLMs）仍难以稳定展现的。然而，现有用于评估地图推理能力的VLM基准数据集普遍存在范围狭窄、局限于特定领域、过度依赖人工生成内容（如LLM输出或流水线方法产物）等问题，难以深入评估真实的地理空间推理能力。为填补这一空白，我们提出了MapVerse——一个基于真实世界地图构建的大规模基准数据集。该数据集包含1,025幅地图中人工撰写的11,837组问答对，涵盖十种不同的地图类别及每类地图对应的多种问题类型。该数据集为评估地图阅读、解析与多模态推理能力提供了丰富的测试场景。我们基于该基准评估了十种前沿模型以建立性能基线并量化推理差距。除整体性能外，我们还进行了细粒度分类分析，从多维度评估模型推理能力，并探究影响推理结果的视觉因素。研究结果表明：尽管当前VLMs在分类式任务中表现良好，但开源与闭源模型在需要复杂空间推理的高级任务上均存在明显不足。