Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, below human annotators who reach 72.0% on average (and 95% for an expert) with strong inter-annotator agreement ($κ$ up to 0.76). While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark evaluations reveal limited transfer. Beyond aggregate accuracy, we systematically analyze difficulty in map-to-street-view reasoning using both structural signals and human effort, and conduct an extensive failure analysis of adapted open models. Our findings highlight persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency, motivating future work on grounded spatial reasoning across viewpoints.
翻译:视觉语言模型(VLM)在多项多模态基准测试中表现强劲,但在需要将抽象俯视图表示与第一人称视角对齐的空间推理任务上仍存在脆弱性。我们提出m2sv,一个用于地图到街景空间推理的可扩展基准,要求模型通过将正北向上俯视图与同一真实世界交叉口拍摄的街景图像对齐,推断摄像机视角方向。我们发布了m2sv-20k(一个地理多样性高且具有可控歧义性的基准)以及m2sv-sft-11k(一个用于监督微调的、包含结构化推理痕迹的精选数据集)。尽管在现有多模态基准测试中表现优异,经评估的最佳VLM在m2sv上仅达到65.2%的准确率,低于平均准确率72.0%的人类标注者(专家可达95%),且标注者间一致性较高(κ值最高达0.76)。虽然监督微调和强化学习可带来持续提升,跨基准评估显示迁移能力有限。除总体准确率外,我们利用结构信号和人工努力系统分析了地图到街景推理的难度,并对适配的开放模型进行了广泛的失败分析。研究结果揭示了几何对齐、证据聚合和推理一致性方面持续存在的差距,为未来跨视角具身空间推理研究提供了方向。