We present a novel framework for probing and improving relational, compositional and contextual understanding of large visual-language models (V+L). While large V+L models have achieved success in various downstream tasks, it is not clear if they have a conceptual grasp of the content. We propose a novel benchmarking dataset for probing three aspects of content understanding. Our probes are grounded in cognitive science and help determine if a V+L model can, for example, determine if snow garnished with a man is implausible, or if it can identify beach furniture by knowing it is located on a beach. We have experimented with 5 well known models, such as CLIP and ViLT, and found that they mostly fail to demonstrate a conceptual understanding. That said, we find interesting insights such as cross-attention helps learning conceptual understanding. We use these insights to propose a new finetuning technique that rewards the three conceptual understanding measures we proposed. We hope that the presented benchmarks will help the community assess and improve the conceptual understanding capabilities of large V+L models.
翻译:我们提出了一种新型框架,用于探究和提升大型视觉-语言(V+L)模型的关系性、组合性及语境理解能力。尽管大型V+L模型已在各类下游任务中取得显著成功,但其是否真正掌握内容的深层概念仍不明确。我们设计了一套创新基准数据集,用于检测内容理解的三个维度。这些探针植根于认知科学理论,能够判断V+L模型是否具备特定认知能力,例如:识别“雪上点缀男人”这一场景的荒谬性,或通过“位于海滩”的语境认知沙滩家具。我们针对CLIP、ViLT等五种主流模型进行实验,发现它们普遍难以展现概念理解能力。然而,研究揭示出跨注意力机制有助于概念学习的启发性结论。基于此发现,我们提出了一种新型微调技术,对提出的三个概念理解指标进行强化奖励。期望该基准体系能助力学界评估并提升大型V+L模型的概念理解能力。