We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo supervisions for referring image segmentation (RIS). These pseudo supervisions allow the training of any supervised RIS methods without the cost of manual labeling. To achieve this, we incorporate existing segmentation and image captioning foundation models, leveraging their broad generalization capabilities. However, the naive incorporation of these models may generate non-distinctive expressions that do not distinctively refer to the target masks. To address this challenge, we propose two-fold strategies that generate distinctive captions: 1) 'distinctive caption sampling', a new decoding method for the captioning model, to generate multiple expression candidates with detailed words focusing on the target. 2) 'distinctiveness-based text filtering' to further validate the candidates and filter out those with a low level of distinctiveness. These two strategies ensure that the generated text supervisions can distinguish the target from other objects, making them appropriate for the RIS annotations. Our method significantly outperforms both weakly and zero-shot SoTA methods on the RIS benchmark datasets. It also surpasses fully supervised methods in unseen domains, proving its capability to tackle the open-world challenge within RIS. Furthermore, integrating our method with human annotations yields further improvements, highlighting its potential in semi-supervised learning applications.
翻译:我们提出了一种新框架,能够自动生成高质量的分割掩码及其对应的指代表达式,作为指代图像分割(RIS)的伪监督信号。这些伪监督使得无需人工标注成本即可训练任何有监督的RIS方法。为实现这一目标,我们整合了现有的分割与图像描述基础模型,充分利用其广泛的泛化能力。然而,若简单组合这些模型,可能生成缺乏区分度的表达,无法明确指向目标掩码。针对这一挑战,我们提出了双重策略以生成具有区分度的描述:1)“差异化描述采样”——一种新的描述模型解码方法,可生成多个包含针对目标细节词汇的表达候选;2)“基于区分度的文本筛选”——进一步验证候选表达并过滤区分度较低的描述。这两项策略确保生成的文本监督能够将目标与其他对象区分开来,使其适用于RIS标注任务。我们的方法在RIS基准数据集上显著优于当前弱监督与零样本最先进方法,并在未见域中超越全监督方法,证明了其应对RIS开放世界挑战的能力。此外,将本方法与人工标注相结合可带来进一步提升,彰显了其在半监督学习应用中的潜力。