The dominant probing approaches rely on the zero-shot performance of image-text matching tasks to gain a finer-grained understanding of the representations learned by recent multimodal image-language transformer models. The evaluation is carried out on carefully curated datasets focusing on counting, relations, attributes, and others. This work introduces an alternative probing strategy called guided masking. The proposed approach ablates different modalities using masking and assesses the model's ability to predict the masked word with high accuracy. We focus on studying multimodal models that consider regions of interest (ROI) features obtained by object detectors as input tokens. We probe the understanding of verbs using guided masking on ViLBERT, LXMERT, UNITER, and VisualBERT and show that these models can predict the correct verb with high accuracy. This contrasts with previous conclusions drawn from image-text matching probing techniques that frequently fail in situations requiring verb understanding. The code for all experiments will be publicly available https://github.com/ivana-13/guided_masking.
翻译:当前的探测方法主要依赖图文匹配任务的零样本性能,以深入理解近期多模态图像-语言Transformer模型所习得的表征。这类评估在精心设计的、聚焦于计数、关系、属性等维度的数据集上进行。本研究提出一种替代性探测策略,即引导掩码。该方法通过掩码对不同模态进行消融,并评估模型高精度预测被掩码词汇的能力。我们重点研究以目标检测器提取的感兴趣区域(ROI)特征作为输入令牌的多模态模型。通过引导掩码对ViLBERT、LXMERT、UNITER和VisualBERT中动词理解能力进行探测,发现这些模型能以高精度预测正确动词。这一发现与先前基于图文匹配探测技术所得结论形成鲜明对比——后者在需要理解动词的场景中常遭遇失败。所有实验代码将发布于https://github.com/ivana-13/guided_masking。