In this paper, we explore the potential of Vision-Language Models (VLMs), specifically CLIP, in predicting visual object relationships, which involves interpreting visual features from images into language-based relations. Current state-of-the-art methods use complex graphical models that utilize language cues and visual features to address this challenge. We hypothesize that the strong language priors in CLIP embeddings can simplify these graphical models paving for a simpler approach. We adopt the UVTransE relation prediction framework, which learns the relation as a translational embedding with subject, object, and union box embeddings from a scene. We systematically explore the design of CLIP-based subject, object, and union-box representations within the UVTransE framework and propose CREPE (CLIP Representation Enhanced Predicate Estimation). CREPE utilizes text-based representations for all three bounding boxes and introduces a novel contrastive training strategy to automatically infer the text prompt for union-box. Our approach achieves state-of-the-art performance in predicate estimation, mR@5 27.79, and mR@20 31.95 on the Visual Genome benchmark, achieving a 15.3\% gain in performance over recent state-of-the-art at mR@20. This work demonstrates CLIP's effectiveness in object relation prediction and encourages further research on VLMs in this challenging domain.
翻译:在本文中,我们探索了视觉语言模型(VLMs),特别是CLIP,在预测视觉对象关系方面的潜力,这涉及将图像中的视觉特征解释为基于语言的关系。当前最先进的方法使用复杂的图形模型,利用语言线索和视觉特征来解决这一挑战。我们假设CLIP嵌入中的强语言先验可以简化这些图形模型,从而实现更简单的方法。我们采用UVTransE关系预测框架,该框架将关系学习为来自场景的主语、宾语和联合框嵌入的平移嵌入。我们系统地探索了在UVTransE框架内基于CLIP的主语、宾语和联合框表示的设计,并提出了CREPE(CLIP表示增强谓词估计)。CREPE对所有三个边界框使用基于文本的表示,并引入了一种新颖的对比训练策略来自动推断联合框的文本提示。我们的方法在谓词估计上取得了最先进的性能,在Visual Genome基准测试中mR@5为27.79,mR@20为31.95,在mR@20上相比近期最先进方法性能提升15.3%。这项工作展示了CLIP在对象关系预测中的有效性,并鼓励在该具有挑战性的领域进一步研究VLMs。