Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).
翻译:分割视觉语言模型(Vision-Language Models, VLMs)显著推进了有根基的视觉理解能力,但仍易产生像素级接地幻觉,即针对错误对象或完全不存在对象生成掩码。现有评估几乎完全依赖基于文本或标签的扰动,仅检查预测掩码是否匹配查询标签。此类评估忽略了幻觉的空间范围和严重程度,因此无法揭示更具挑战性和普遍性的视觉驱动幻觉。为填补这一空白,我们正式定义了反事实分割推理(Counterfactual Segmentation Reasoning, CSR)任务:模型需在事实图像中分割所指称对象,并在其反事实对应图像中放弃分割。为支持该任务,我们构建了HalluSegBench——首个利用受控视觉反事实诊断指代与推理表述分割幻觉的大规模基准,同时提出评估幻觉严重性并解耦视觉与语言驱动故障模式的新指标。我们进一步提出了RobustSeg,一种通过反事实微调(Counterfactual Fine-Tuning, CFT)训练的分割VLM,使其学习何时分割、何时放弃。实验结果表明,RobustSeg在减少30%幻觉的同时,提升了FP-RefCOCO(+/g)上的分割性能。