Vision-language models (VLMs) can answer spatial relation queries, yet a correct answer does not reveal whether the model truly uses directional evidence or merely exploits object layout. We present CREG (Compass Relational Evidence Graph), a training-free diagnostic framework that converts any token-level attribution map into a reference-parameterized compass distribution and evaluates it with Direction Alignment Error (DAE) and Edge Accuracy (EA). Across three VLMs and two primary benchmarks with native boxes (COCO-Pairs and VG-Spatial), plus supplementary VSR, CREG enables direct comparison of heterogeneous attribution methods on a shared directional scale; Chefer et al. is usually the strongest plug-in, indicating that the framework is not tied to our contrastive Grad-Act signal. Using CREG to probe VLM spatial attribution, we find that attribution is largely layout-driven: changing the queried direction leaves compass outputs near random, and re-centering the projection provides no advantage for the true reference origin. At the same time, CREG detects a limited residual directional component once image identity is controlled. This residual structure is practically useful: lower DAE predicts VLM correctness (AUC up to 0.65) and supports selective prediction and test-time re-ranking, improving accuracy by 14.0 percentage points on COCO-Pairs. CREG provides a unified way to measure directional organization in VLM attribution, making layout bias and residual relational signal explicit and quantifiable.
翻译:视觉语言模型(VLM)能够回答空间关系查询,但正确答案并不能揭示模型是真正使用了方向证据,还是仅仅利用了物体布局。我们提出CREG(罗盘关系证据图),一种无需训练的诊断框架,可将任意token级归因图转换为参考参数化的罗盘分布,并通过方向对齐误差(DAE)和边准确率(EA)进行评估。在三个VLM及两个带有原生边界框的主流基准(COCO-Pairs和VG-Spatial)以及补充数据集VSR上,CREG能够在共享方向尺度上直接比较异构归因方法;其中Chefer等人的方法通常是最强的插件,表明该框架不依赖于我们的对比性梯度激活信号。通过CREG探测VLM的空间归因,我们发现归因主要受布局驱动:改变查询方向时罗盘输出接近于随机,且重新居中投影对真实参考原点并无优势。同时,在控制图像标识后,CREG检测到有限的残留方向成分。这一残留结构具有实际用途:较低的DAE可预测VLM的正确性(AUC最高达0.65),并支持选择性预测和测试时重排序,在COCO-Pairs上将准确率提升14.0个百分点。CREG提供了一种统一的方法来度量VLM归因中的方向组织性,使布局偏差和残留关系信号得以显式化和可量化。