Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing "under" or "behind" relationships between only two objects, pose significant challenges for current VLMs. In this work, we study the spatial reasoning challenge from the lens of mechanistic interpretability, diving into the model's internal states to examine the interactions between image and text tokens. By tracing attention distribution over the image through out intermediate layers, we observe that successful spatial reasoning correlates strongly with the model's ability to align its attention distribution with actual object locations, particularly differing between familiar and unfamiliar spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inference-time confidence scores to sharpen the attention on highly relevant regions when confident, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) on spatial reasoning benchmarks such as WhatsUp and VSR with negligible cost. We make code and data publicly available for research purposes at https://github.com/shiqichen17/AdaptVis.
翻译:大型视觉语言模型(VLMs)长期以来在空间推理任务上表现不佳。令人惊讶的是,即使是简单的空间推理任务,例如仅识别两个物体之间的“下方”或“后方”关系,对当前VLM也构成重大挑战。本研究从机制可解释性视角探究空间推理难题,深入模型内部状态以检验图像与文本标记间的交互作用。通过追踪中间层对图像的注意力分布,我们观察到成功的空间推理与模型将其注意力分布与实际物体位置对齐的能力密切相关,尤其在熟悉与不熟悉的空间关系之间存在显著差异。基于这些发现,我们提出ADAPTVIS方法,该方法基于推理时置信度分数,在置信度高时锐化对高度相关区域的注意力,而在置信度较低时平滑并拓宽注意力窗口以考虑更广泛的上下文。这种无需训练的解码方法在空间推理基准测试(如WhatsUp和VSR)上显示出显著改进(例如绝对提升高达50分),且计算成本可忽略。我们已将代码和数据公开于https://github.com/shiqichen17/AdaptVis以供研究使用。