3VL: using Trees to teach Vision & Language models compositional concepts

Vision-Language models (VLMs) have proved effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer some key shortcomings in Compositional Language Concepts (CLC) understanding such as recognizing objects' attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the Tree-augmented Vision-Language (3VL) model architecture and training technique accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows inducing this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be employed to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also exhibit how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model's success or failure.

翻译：视觉-语言模型（VLM）在图像与文本表征对齐方面展现出显著成效，迁移至众多下游任务时均能产生优异的零样本结果。然而，这些表征在组合语言概念理解方面存在关键短板，例如难以识别物体属性、状态及不同物体间的相互关系。此外，VLM通常可解释性较差，使得调试和缓解组合理解失败问题颇具挑战。本文提出树增强视觉-语言（3VL）模型架构与训练技术，并配套提出我们研发的Anchor推理方法和差分相关性（DiRe）可解释性工具。通过利用语言分析工具将任意图像-文本对的文本扩展为层次化树结构，3VL能够将该结构引入模型学习的视觉表征中，从而增强其可解释性与组合推理能力。同时，我们展示了如何运用Anchor这一简单的文本统一技术来过滤干扰因素，同时提升组合语言概念理解性能——例如在基础VL-Checklist基准测试上的表现。我们还展示了DiRe如何通过比较VLM相关性图的差异，生成关于模型成功或失败原因的可视化解释。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日