VertiCue-Bench: Diagnosing Whether MLLMs Use Height Cues to Resolve 2D Ambiguity in Remote Sensing Natural Scenes

Multimodal Large Language Models (MLLMs) have recently shown promising progress in geospatial reasoning. However, existing remote sensing benchmarks remain largely 2D-centric, evaluating models primarily on optical appearance. In natural environments, this paradigm breaks down due to severe spectral confusion, where ecologically distinct regions share similar textures but differ fundamentally in vertical structure. In such cases, explicit 3D structural data, such as Canopy Height Models (CHMs), become essential geometric evidence for semantic disambiguation. Yet, it remains unclear whether current MLLMs can genuinely leverage vertical cues to resolve appearance-level ambiguity. To address this gap, we introduce VertiCue-Bench, the first diagnostic benchmark for CHM-grounded geospatial reasoning. VertiCue-Bench comprises 1,534 carefully curated instances across 17 tasks, explicitly disentangling low-level height perception from ambiguity-aware semantic reasoning. Evaluations on 14 state-of-the-art general and remote-sensing-specialized MLLMs, combined with counterfactual modality testing, reveal a striking perception-reasoning dissociation. While models exhibit emerging competence in reading raw CHM height cues, they largely fail to translate geometric perception into reliable semantic reasoning, often underperforming RGB-only baselines when joint constraints are required. Overall, VertiCue-Bench exposes a critical geometry-to-semantics gap in natural scene understanding, offering actionable insights for advancing geospatial MLLMs.

翻译：多模态大语言模型（MLLMs）近来在地理空间推理领域展现出可喜进展。然而，现有遥感基准测试仍以二维为中心，主要基于光学外观评估模型。在自然环境中，由于严重的光谱混淆（生态上不同的区域具有相似纹理，但在垂直结构上存在根本性差异），这种范式会失效。在此类情况下，显式的三维结构数据（如冠层高度模型CHMs）成为语义消歧所必需的关键几何证据。然而，当前MLLMs能否真正利用垂直线索解决外观层级的歧义尚不明确。为填补这一空白，我们提出VertiCue-Bench——首个面向CHM驱动的地理空间推理的诊断性基准测试。VertiCue-Bench包含精心构建的1,534个实例，覆盖17项任务，明确解构了低层级高度感知与歧义感知语义推理。通过对14个最先进的通用及遥感专用MLLMs进行评估，并结合反事实模态测试，揭示了显著的感知-推理分离现象：尽管模型在读取原始CHM高度线索方面展现出新兴能力，但大多未能将几何感知转化为可靠的语义推理，在需要联合约束时其表现常常劣于仅基于RGB的基线模型。总体而言，VertiCue-Bench暴露了自然场景理解中关键的几何到语义鸿沟，为推进地理空间MLLMs提供了可操作的洞见。