In most Vision-Language models (VL), the understanding of the image structure is enabled by injecting the position information (PI) about objects in the image. In our case study of LXMERT, a state-of-the-art VL model, we probe the use of the PI in the representation and study its effect on Visual Question Answering. We show that the model is not capable of leveraging the PI for the image-text matching task on a challenge set where only position differs. Yet, our experiments with probing confirm that the PI is indeed present in the representation. We introduce two strategies to tackle this: (i) Positional Information Pre-training and (ii) Contrastive Learning on PI using Cross-Modality Matching. Doing so, the model can correctly classify if images with detailed PI statements match. Additionally to the 2D information from bounding boxes, we introduce the object's depth as new feature for a better object localization in the space. Even though we were able to improve the model properties as defined by our probes, it only has a negligible effect on the downstream performance. Our results thus highlight an important issue of multimodal modeling: the mere presence of information detectable by a probing classifier is not a guarantee that the information is available in a cross-modal setup.
翻译:在大多数视觉-语言模型(VL)中,图像结构的理解是通过注入关于图像中对象的位置信息(PI)实现的。在以LXMERT(一种先进的VL模型)为案例的研究中,我们探究了表征中位置信息的利用情况,并研究了其对视觉问答任务的影响。我们发现,在仅位置差异的挑战性数据集上,该模型无法利用位置信息完成图像-文本匹配任务。然而,我们的探查实验证实位置信息确实存在于表征中。我们引入了两种策略来解决这一问题:(i)位置信息预训练;(ii)基于交叉模态匹配的位置信息对比学习。通过这两种方法,模型能够正确分类包含详细位置信息描述的图像是否匹配。除了来自边界框的二维信息外,我们还引入了对象的深度作为新特征,以实现在空间中的更精准对象定位。尽管我们根据探查结果改进了模型特性,但这仅对下游任务性能产生了微不足道的影响。我们的研究结果因此凸显了多模态建模中的一个重要问题:探查分类器能够检测到的信息存在性,并不能保证该信息在跨模态场景中是可用的。