Reasoning with Pixel-level Precision: QVLM Architecture and SQuID Dataset for Quantitative Geospatial Analytics

from arxiv, Submitted to CVPR 2026. Introduces the QVLM architecture and the SQuID dataset for quantitative geospatial reasoning. Dataset DOI: 10.57967/hf/7565

Current Vision-Language Models (VLMs) fail at quantitative spatial reasoning because their architectures destroy pixel-level information required for counting and measurements. Vision encoders compress images through patch embeddings, reducing spatial indexing and losing the precise pixel-level tracking required for accurate counting. We present two contributions to address this fundamental limitation. First, we introduce SQuID (Satellite Quantitative Intelligence Dataset), a benchmark of 2,000 satellite image Question-Answer pairs with both numerical range and categorical answers, designed to evaluate quantitative spatial reasoning. The dataset spans three difficulty tiers with annotations automatically generated from human labels and their learned variability. Second, we propose QVLM (Quantitative Vision-Language Model), a code-generation architecture that maintains pixel precision by decoupling language understanding from visual analysis. Instead of encoding images into embeddings, QVLM generates executable code that first calls a segmentation model to obtain pixel-level masks, then operates directly on these masks, preserving spatial indexing throughout the reasoning process. Our experiments show that QVLM using GPT-5 as coder achieves 42.0% accuracy on SQuID compared to 28.1% for a VLM prompted with image-question pairs. Our work reveals that, for quantitative spatial reasoning, architectural decoupling enables better accuracy on quantitative tasks.

翻译：当前视觉-语言模型在定量空间推理任务上表现不佳，因为其架构破坏了计数与测量所需的像素级信息。视觉编码器通过图像块嵌入对图像进行压缩，削弱了空间索引能力，并丢失了精确计数所必需的像素级追踪细节。为应对这一根本性限制，我们提出了两项贡献。首先，我们引入了SQuID（卫星定量智能数据集），这是一个包含2000个卫星图像问答对的基准数据集，同时包含数值范围与分类答案，旨在评估定量空间推理能力。该数据集涵盖三个难度层级，其标注由人工标签及其学习到的变异性自动生成。其次，我们提出了QVLM（定量视觉-语言模型），这是一种通过将语言理解与视觉分析解耦来保持像素精度的代码生成架构。QVLM不将图像编码为嵌入向量，而是生成可执行代码：该代码首先调用分割模型获取像素级掩码，随后直接在这些掩码上进行操作，从而在整个推理过程中保持空间索引。实验表明，使用GPT-5作为编码器的QVLM在SQuID数据集上达到了42.0%的准确率，而仅接收图像-问题对提示的VLM准确率为28.1%。我们的工作表明，对于定量空间推理任务，架构解耦能够在定量问题上实现更高的准确率。