Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

Zero-shot referring expression comprehension (REC) aims to locate target objects in images given natural language queries without relying on task-specific training data, demanding strong visual understanding capabilities. Existing Vision-Language Models~(VLMs), such as CLIP, commonly address zero-shot REC by directly measuring feature similarities between textual queries and image regions. However, these methods struggle to capture fine-grained visual details and understand complex object relationships. Meanwhile, Large Language Models~(LLMs) excel at high-level semantic reasoning, their inability to directly abstract visual features into textual semantics limits their application in REC tasks. To overcome these limitations, we propose \textbf{SGREC}, an interpretable zero-shot REC method leveraging query-driven scene graphs as structured intermediaries. Specifically, we first employ a VLM to construct a query-driven scene graph that explicitly encodes spatial relationships, descriptive captions, and object interactions relevant to the given query. By leveraging this scene graph, we bridge the gap between low-level image regions and higher-level semantic understanding required by LLMs. Finally, an LLM infers the target object from the structured textual representation provided by the scene graph, responding with detailed explanations for its decisions that ensure interpretability in the inference process. Extensive experiments show that SGREC achieves top-1 accuracy on most zero-shot REC benchmarks, including RefCOCO val (66.78\%), RefCOCO+ testB (53.43\%), and RefCOCOg val (73.28\%), highlighting its strong visual scene understanding.

翻译：零样本指代表达理解旨在根据自然语言查询定位图像中的目标物体，无需依赖特定任务的训练数据，需要强大的视觉理解能力。现有的视觉-语言模型（如CLIP）通常通过直接度量文本查询与图像区域之间的特征相似度来处理零样本指代表达理解任务。然而，这些方法难以捕捉细粒度的视觉细节并理解复杂的对象关系。与此同时，大语言模型擅长高层级语义推理，但它们无法直接将视觉特征抽象为文本语义，这限制了其在指代表达理解任务中的应用。为克服这些局限，我们提出\textbf{SGREC}，一种利用查询驱动场景图作为结构化中介的可解释零样本指代表达理解方法。具体而言，我们首先使用视觉-语言模型构建查询驱动场景图，显式编码与给定查询相关的空间关系、描述性标注和对象交互。通过利用该场景图，我们弥合了低层级图像区域与LLM所需的高层级语义理解之间的鸿沟。最后，大语言模型从场景图提供的结构化文本表示中推断目标对象，并输出详细决策解释，确保推理过程的可解释性。大量实验表明，SGREC在大多数零样本指代表达理解基准上取得了最高准确率，包括RefCOCO验证集（66.78%）、RefCOCO+测试集B（53.43%）和RefCOCOg验证集（73.28%），突显了其强大的视觉场景理解能力。