With the utilization of Transformer architecture, large Vision and Language (V&L) models have shown promising performance in even zero-shot settings. Several studies, however, indicate a lack of robustness of the models when dealing with complex linguistics and visual attributes. In this work, we introduce a novel V&L benchmark - ColorFoil, by creating color-related foils to assess the models' perception ability to detect colors like red, white, green, etc. We evaluate seven state-of-the-art V&L models including CLIP, ViLT, GroupViT, and BridgeTower, etc. in a zero-shot setting and present intriguing findings from the V&L models. The experimental evaluation indicates that ViLT and BridgeTower demonstrate much better color perception capabilities compared to CLIP and its variants and GroupViT. Moreover, CLIP-based models and GroupViT struggle to distinguish colors that are visually distinct to humans with normal color perception ability.
翻译:随着Transformer架构的应用,大型视觉与语言(V&L)模型即使在零样本设置下也展现出良好的性能。然而,多项研究表明,这些模型在处理复杂的语言和视觉属性时缺乏鲁棒性。本文通过构建与颜色相关的对抗性样本,引入了一个新型V&L基准测试——ColorFoil,旨在评估模型对红色、白色、绿色等颜色的感知能力。我们评估了包括CLIP、ViLT、GroupViT和BridgeTower在内的七种先进V&L模型在零样本设置下的表现,并得出了有趣的研究发现。实验评估表明,与CLIP及其变体和GroupViT相比,ViLT和BridgeTower展现出更优越的颜色感知能力。此外,基于CLIP的模型和GroupViT难以区分对于具有正常颜色感知能力的人类而言视觉差异明显的颜色。