To what degree and under what conditions do VLMs rely on scene context when generating references to objects? To address this question, we introduce the $\textit{Common Objects Out-of-Context (COOCo)}$ dataset and conduct experiments on several VLMs under different degrees of scene-object congruency and noise. We find that models leverage scene context adaptively, depending on scene-object semantic relatedness and noise level. Based on these consistent trends across models, we turn to the question of how VLM attention patterns change as a function of target-scene semantic fit, and to what degree these patterns are predictive of categorisation accuracy. We find that successful object categorisation is associated with increased mid-layer attention to the target. We also find a non-monotonic dependency on semantic fit, with attention dropping at moderate fit and increasing for both low and high fit. These results suggest that VLMs dynamically balance local and contextual information for reference generation. Dataset and code are available here: $\href{https://github.com/cs-nlp-uu/scenereg}{https://github.com/cs-nlp-uu/scenereg}$.
翻译:视觉语言模型(VLM)在生成物体指称时,在何种程度及何种条件下依赖场景上下文?为探究此问题,我们提出了《常见物体脱离上下文(COOCo)》数据集,并在不同场景-物体一致性程度及噪声水平下对多个VLM进行实验。研究发现,模型对场景上下文的利用具有适应性,其依赖程度取决于场景-物体语义关联度及噪声水平。基于各模型表现出的稳定趋势,我们进一步探究了VLM注意力模式如何随目标-场景语义适配度变化,以及这些模式在多大程度上能预测分类准确性。实验表明,成功的物体分类与模型中间层对目标物体注意力的增强相关。同时,注意力对语义适配度呈现非单调依赖关系:在中等适配度时注意力下降,而在低适配度与高适配度时均会上升。这些结果表明,VLM在生成指称时动态平衡局部信息与上下文信息。数据集与代码已开源:https://github.com/cs-nlp-uu/scenereg。