Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., "left-most apple") and overlooks functional and physical reasoning (e.g., "where can I safely store the knife?"). We address this gap and introduce Conversational Image Segmentation (CIS) and ConverSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConverSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt-mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConverSeg-Net trained on our data engine achieves significant gains on ConverSeg and maintains strong performance on existing language-guided segmentation benchmarks. Project webpage: https://glab-caltech.github.io/converseg/
翻译:对话式图像分割将抽象的、意图驱动的概念转化为像素级精确的掩码。先前关于指代式图像基础定位的研究主要关注类别和空间查询(例如“最左侧的苹果”),而忽视了功能和物理推理(例如“我可以把刀安全地存放在哪里?”)。我们针对这一不足,提出了对话式图像分割(CIS)及ConverSeg基准数据集,该数据集涵盖实体、空间关系、意图、可供性、功能、安全性和物理推理等多个维度。我们还提出了ConverSeg-Net模型,该模型融合了强大的分割先验知识与语言理解能力,并引入一个由人工智能驱动的数据引擎,可在无需人工监督的情况下生成提示-掩码对。研究表明,当前的语言引导分割模型难以胜任CIS任务,而基于我们数据引擎训练的ConverSeg-Net在ConverSeg基准上取得了显著提升,同时在现有语言引导分割基准上保持了强劲性能。项目网页:https://glab-caltech.github.io/converseg/