Referring image segmentation, the task of segmenting any arbitrary entities described in free-form texts, opens up a variety of vision applications. However, manual labeling of training data for this task is prohibitively costly, leading to lack of labeled data for training. We address this issue by a weakly supervised learning approach using text descriptions of training images as the only source of supervision. To this end, we first present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent. We also present a new loss function that allows the model to be trained without any further supervision. Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation models on all the benchmarks.
翻译:指代图像分割任务旨在分割自由形式文本所描述的任意实体,这开启了多种视觉应用。然而,该任务训练数据的人工标注成本极高,导致标注数据匮乏。我们提出一种弱监督学习方法,仅利用训练图像的文本描述作为监督信号来解决该问题。为此,我们首先提出一个新模型,该模型能发现输入图像中的语义实体,并进一步将文本查询相关的实体组合以预测目标掩码。同时,我们设计了一种新的损失函数,使模型无需额外监督即可训练。该方法在四个公开指代图像分割基准上进行了评估,在所有基准中均显著优于现有同类方法及近期开放词汇分割模型。