Vision-Language models (VLMs) have proved effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer some key shortcomings in Compositional Language Concepts (CLC) understanding such as recognizing objects' attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the Tree-augmented Vision-Language (3VL) model architecture and training technique accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows inducing this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be employed to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also exhibit how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model's success or failure.
翻译:视觉-语言模型(VLM)在图像与文本表征对齐方面展现出显著成效,迁移至众多下游任务时均能产生优异的零样本结果。然而,这些表征在组合语言概念理解方面存在关键短板,例如难以识别物体属性、状态及不同物体间的相互关系。此外,VLM通常可解释性较差,使得调试和缓解组合理解失败问题颇具挑战。本文提出树增强视觉-语言(3VL)模型架构与训练技术,并配套提出我们研发的Anchor推理方法和差分相关性(DiRe)可解释性工具。通过利用语言分析工具将任意图像-文本对的文本扩展为层次化树结构,3VL能够将该结构引入模型学习的视觉表征中,从而增强其可解释性与组合推理能力。同时,我们展示了如何运用Anchor这一简单的文本统一技术来过滤干扰因素,同时提升组合语言概念理解性能——例如在基础VL-Checklist基准测试上的表现。我们还展示了DiRe如何通过比较VLM相关性图的差异,生成关于模型成功或失败原因的可视化解释。