Prompt-Guided Mask Proposal for Two-Stage Open-Vocabulary Segmentation

We tackle the challenge of open-vocabulary segmentation, where we need to identify objects from a wide range of categories in different environments, using text prompts as our input. To overcome this challenge, existing methods often use multi-modal models like CLIP, which combine image and text features in a shared embedding space to bridge the gap between limited and extensive vocabulary recognition, resulting in a two-stage approach: In the first stage, a mask generator takes an input image to generate mask proposals, and the in the second stage the target mask is picked based on the query. However, the expected target mask may not exist in the generated mask proposals, which leads to an unexpected output mask. In our work, we propose a novel approach named Prompt-guided Mask Proposal (PMP) where the mask generator takes the input text prompts and generates masks guided by these prompts. Compared with mask proposals generated without input prompts, masks generated by PMP are better aligned with the input prompts. To realize PMP, we designed a cross-attention mechanism between text tokens and query tokens which is capable of generating prompt-guided mask proposals after each decoding. We combined our PMP with several existing works employing a query-based segmentation backbone and the experiments on five benchmark datasets demonstrate the effectiveness of this approach, showcasing significant improvements over the current two-stage models (1% ~ 3% absolute performance gain in terms of mIOU). The steady improvement in performance across these benchmarks indicates the effective generalization of our proposed lightweight prompt-aware method.

翻译：我们致力于解决开放词汇分割的挑战，该任务需要利用文本提示作为输入，在不同环境中识别广泛类别中的物体。为应对这一挑战，现有方法通常使用如CLIP等多模态模型，这些模型将图像和文本特征结合到共享的嵌入空间中，以弥合有限词汇与广泛词汇识别之间的差距，从而形成两阶段方法：第一阶段，掩码生成器接收输入图像以生成掩码提议；第二阶段，基于查询选择目标掩码。然而，预期的目标掩码可能不存在于生成的掩码提议中，这会导致输出掩码出现意外错误。在本工作中，我们提出了一种名为提示引导掩码提议（PMP）的新方法，其中掩码生成器接收输入文本提示并基于这些提示生成掩码。与未使用输入提示生成的掩码提议相比，PMP生成的掩码与输入提示更匹配。为实现PMP，我们设计了一种文本标记与查询标记之间的交叉注意力机制，该机制能够在每次解码后生成提示引导的掩码提议。我们将PMP与多种采用基于查询的分割主干网络的现有工作相结合，在五个基准数据集上的实验证明了该方法的有效性，相较于当前的两阶段模型展现出显著提升（在mIOU指标上获得1%至3%的绝对性能增益）。这些基准测试中性能的稳定提升表明，我们提出的轻量级提示感知方法具有有效的泛化能力。