Current open-source Large Multimodal Models (LMMs) excel at tasks such as open-vocabulary language grounding and segmentation but can suffer under false premises when queries imply the existence of something that is not actually present in the image. We observe that existing methods that fine-tune an LMM to segment images significantly degrade their ability to reliably determine ("see") if an object is present and to interact naturally with humans ("say"), a form of catastrophic forgetting. In this work, we propose a cascading and joint training approach for LMMs to solve this task, avoiding catastrophic forgetting of previous skills. Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, proposing alternative queries or correcting semantic errors in the query, and finally "segment" by outputting the mask of the desired objects if they exist. Additionally, we introduce a novel False Premise Correction benchmark dataset, an extension of existing RefCOCO(+/g) referring segmentation datasets (which we call FP-RefCOCO(+/g)). The results show that our method not only detects false premises up to 55% better than existing approaches, but under false premise conditions produces relative cIOU improvements of more than 31% over baselines, and produces natural language feedback judged helpful up to 67% of the time.
翻译:当前开源大型多模态模型(LMMs)在开放词汇语言定位与分割等任务中表现优异,但当查询隐含图像中不存在之物时,会因虚假前提而出现错误。我们观察到,现有通过微调LMM进行图像分割的方法会显著削弱模型可靠判断(“看见”)物体是否存在以及与人类自然交互(“说出”)的能力,这构成了一种灾难性遗忘现象。本文提出一种级联联合训练方法,使LMM在解决该任务时避免对先前技能的灾难性遗忘。所提模型可通过检测图像中物体是否存在实现“看见”,通过告知用户物体不存在、提出替代查询或纠正查询语义错误实现“说出”,最后在物体存在时输出所需对象掩码实现“分割”。此外,我们引入一个新型虚假前提纠正基准数据集,该数据集是对现有RefCOCO(+/g)指代分割数据集(命名为FP-RefCOCO(+/g))的扩展。实验结果表明,我们的方法不仅相比现有方法能提升55%的虚假前提检测能力,而且在虚假前提条件下,相较于基线方法实现相对cIOU提升超过31%,并能在67%的情况下提供被认为有帮助的自然语言反馈。