We propose a methodology and design two benchmark sets for measuring to what extent language-and-vision language models use the visual signal in the presence or absence of stereotypes. The first benchmark is designed to test for stereotypical colors of common objects, while the second benchmark considers gender stereotypes. The key idea is to compare predictions when the image conforms to the stereotype to predictions when it does not. Our results show that there is significant variation among multimodal models: the recent Transformer-based FLAVA seems to be more sensitive to the choice of image and less affected by stereotypes than older CNN-based models such as VisualBERT and LXMERT. This effect is more discernible in this type of controlled setting than in traditional evaluations where we do not know whether the model relied on the stereotype or the visual signal.
翻译:我们提出了一种方法论并设计了两个基准测试集,用于衡量语言-视觉语言模型在存在或不存在刻板印象时利用视觉信号的程度。第一个基准测试旨在检验常见物体的刻板颜色,第二个基准测试则涉及性别刻板印象。核心思路是比较图像符合刻板印象与不符合刻板印象时的预测结果。我们的研究结果表明,不同多模态模型之间存在显著差异:相比基于CNN的旧模型(如VisualBERT和LXMETER),基于Transformer的最新模型FLAVA似乎对图像选择更敏感,且受刻板印象影响更小。在受控设定下,这一效应比传统评估(我们无法判断模型依据的是刻板印象还是视觉信号)更为明显。