Large-scale, pre-trained neural networks have demonstrated strong capabilities in various tasks, including zero-shot image segmentation. To identify concrete objects in complex scenes, humans instinctively rely on deictic descriptions in natural language, i.e., referring to something depending on the context such as "The object that is on the desk and behind the cup.". However, deep learning approaches cannot reliably interpret such deictic representations due to their lack of reasoning capabilities in complex scenarios. To remedy this issue, we propose DeiSAM -- a combination of large pre-trained neural networks with differentiable logic reasoners -- for deictic promptable segmentation. Given a complex, textual segmentation description, DeiSAM leverages Large Language Models (LLMs) to generate first-order logic rules and performs differentiable forward reasoning on generated scene graphs. Subsequently, DeiSAM segments objects by matching them to the logically inferred image regions. As part of our evaluation, we propose the Deictic Visual Genome (DeiVG) dataset, containing paired visual input and complex, deictic textual prompts. Our empirical results demonstrate that DeiSAM is a substantial improvement over purely data-driven baselines for deictic promptable segmentation.
翻译:大规模预训练神经网络已在多种任务中展现出强大能力,包括零样本图像分割。为识别复杂场景中的具体物体,人类本能地依赖自然语言中的指示性描述,即依据上下文指代物体,例如"位于书桌上且处于杯子后方的物体"。然而,深度学习模型因缺乏复杂场景下的推理能力,难以可靠解析此类指示性表征。为解决该问题,我们提出DeiSAM——一种将大规模预训练神经网络与可微分逻辑推理器相结合的系统——用于指示性提示分割。给定复杂的文本分割描述,DeiSAM利用大语言模型生成一阶逻辑规则,并在生成的场景图上执行可微分前向推理。随后,DeiSAM通过将物体与逻辑推断的图像区域进行匹配来实现分割。作为评估体系的一部分,我们构建了指示性视觉基因组数据集,其中包含配对的视觉输入与复杂的指示性文本提示。实证结果表明,在指示性提示分割任务中,DeiSAM较纯数据驱动的基线方法实现了显著提升。