Promptable segmentation typically requires instance-specific manual prompts to guide the segmentation of each desired object. To minimize such a need, task-generic promptable segmentation has been introduced, which employs a single task-generic prompt to segment various images of different objects in the same task. Current methods use Multimodal Large Language Models (MLLMs) to reason detailed instance-specific prompts from a task-generic prompt for improving segmentation accuracy. The effectiveness of this segmentation heavily depends on the precision of these derived prompts. However, MLLMs often suffer hallucinations during reasoning, resulting in inaccurate prompting. While existing methods focus on eliminating hallucinations to improve a model, we argue that MLLM hallucinations can reveal valuable contextual insights when leveraged correctly, as they represent pre-trained large-scale knowledge beyond individual images. In this paper, we utilize hallucinations to mine task-related information from images and verify its accuracy for enhancing precision of the generated prompts. Specifically, we introduce an iterative Prompt-Mask Cycle generation framework (ProMaC) with a prompt generator and a mask generator.The prompt generator uses a multi-scale chain of thought prompting, initially exploring hallucinations for extracting extended contextual knowledge on a test image.These hallucinations are then reduced to formulate precise instance-specific prompts, directing the mask generator to produce masks that are consistent with task semantics by mask semantic alignment. The generated masks iteratively induce the prompt generator to focus more on task-relevant image areas and reduce irrelevant hallucinations, resulting jointly in better prompts and masks. Experiments on 5 benchmarks demonstrate the effectiveness of ProMaC. Code given in https://lwpyh.github.io/ProMaC/.
翻译:可提示分割通常需要针对具体实例的人工提示来引导每个目标对象的分割。为减少此类需求,任务通用型可提示分割方法被提出,其采用单一任务通用提示来分割同一任务中不同物体的各类图像。现有方法利用多模态大语言模型从任务通用提示中推理出详细的实例特定提示,以提高分割精度。该分割效果在很大程度上取决于这些衍生提示的精确性。然而,多模态大语言模型在推理过程中常出现幻觉现象,导致提示不准确。现有方法侧重于消除幻觉以改进模型,我们认为若正确利用,多模态大语言模型的幻觉能揭示有价值的上下文信息,因为它们代表了超越单个图像的预训练大规模知识。本文利用幻觉从图像中挖掘任务相关信息,并通过验证其准确性来提升生成提示的精确度。具体而言,我们提出了一个包含提示生成器和掩码生成器的迭代式提示-掩码循环生成框架。提示生成器采用多尺度思维链提示策略,首先通过探索幻觉来提取测试图像的扩展上下文知识,随后将这些幻觉信息精炼为精确的实例特定提示,通过掩码语义对齐机制引导掩码生成器产生符合任务语义的掩码。生成的掩码会迭代促使提示生成器更专注于任务相关图像区域,同时减少无关幻觉,从而协同优化提示与掩码质量。在五个基准数据集上的实验验证了该框架的有效性。代码发布于 https://lwpyh.github.io/ProMaC/。