We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo supervisions for referring image segmentation (RIS). These pseudo supervisions allow the training of any supervised RIS methods without the cost of manual labeling. To achieve this, we incorporate existing segmentation and image captioning foundation models, leveraging their broad generalization capabilities. However, the naive incorporation of these models may generate non-distinctive expressions that do not distinctively refer to the target masks. To address this challenge, we propose two-fold strategies that generate distinctive captions: 1) 'distinctive caption sampling', a new decoding method for the captioning model, to generate multiple expression candidates with detailed words focusing on the target. 2) 'distinctiveness-based text filtering' to further validate the candidates and filter out those with a low level of distinctiveness. These two strategies ensure that the generated text supervisions can distinguish the target from other objects, making them appropriate for the RIS annotations. Our method significantly outperforms both weakly and zero-shot SoTA methods on the RIS benchmark datasets. It also surpasses fully supervised methods in unseen domains, proving its capability to tackle the open-world challenge within RIS. Furthermore, integrating our method with human annotations yields further improvements, highlighting its potential in semi-supervised learning applications.
翻译:本文提出了一种新型框架,能够自动生成高质量的分割掩码及其对应的指称表达式,作为指称图像分割(RIS)任务的伪监督信号。这些伪监督使得任何有监督RIS方法都能够在无需人工标注成本的情况下进行训练。为实现这一目标,我们整合了现有的分割与图像描述基础模型,充分利用其广泛的泛化能力。然而,直接组合这些模型可能生成缺乏区分度的描述文本,无法明确指向目标掩码。为解决这一挑战,我们提出双重策略以生成具有区分度的描述:1)“差异化描述采样”——一种针对描述模型的新型解码方法,通过生成包含目标细节词汇的多个表达候选;2)“基于区分度的文本筛选”——进一步验证候选描述并过滤区分度较低的文本。这两项策略确保生成的文本监督能够将目标与其他对象区分开来,使其适用于RIS标注任务。我们的方法在RIS基准数据集上显著超越了弱监督与零样本的现有最优方法,在未见领域甚至超越了全监督方法,证明了其应对RIS开放世界挑战的能力。此外,将本方法与人工标注相结合可带来进一步提升,彰显了其在半监督学习应用中的潜力。