In this work, we investigate performing semantic segmentation solely through the training on image-sentence pairs. Due to the lack of dense annotations, existing text-supervised methods can only learn to group an image into semantic regions via pixel-insensitive feedback. As a result, their grouped results are coarse and often contain small spurious regions, limiting the upper-bound performance of segmentation. On the other hand, we observe that grouped results from self-supervised models are more semantically consistent and break the bottleneck of existing methods. Motivated by this, we introduce associate self-supervised spatially-consistent grouping with text-supervised semantic segmentation. Considering the part-like grouped results, we further adapt a text-supervised model from image-level to region-level recognition with two core designs. First, we encourage fine-grained alignment with a one-way noun-to-region contrastive loss, which reduces the mismatched noun-region pairs. Second, we adopt a contextually aware masking strategy to enable simultaneous recognition of all grouped regions. Coupled with spatially-consistent grouping and region-adapted recognition, our method achieves 59.2% mIoU and 32.4% mIoU on Pascal VOC and Pascal Context benchmarks, significantly surpassing the state-of-the-art methods.
翻译:在这项工作中,我们探索仅通过图像-句子对训练来执行语义分割。由于缺乏密集标注,现有的文本监督方法只能通过像素不敏感的反馈学习将图像分组为语义区域。因此,它们的分组结果较为粗糙,且常包含细小虚假区域,限制了分割性能的上限。另一方面,我们观察到自监督模型的分组结果在语义上更为一致,突破了现有方法的瓶颈。受此启发,我们将自监督空间一致分组与文本监督语义分割相关联。考虑到类似部件的分组结果,我们进一步通过两个核心设计将文本监督模型从图像级识别适应到区域级识别。首先,我们通过单向名词-区域对比损失鼓励细粒度对齐,以减少不匹配的名词-区域对。其次,我们采用上下文感知遮蔽策略,以实现对所有分组区域的同时识别。结合空间一致分组与区域适应识别,我们的方法在Pascal VOC和Pascal Context基准上分别达到59.2% mIoU和32.4% mIoU,显著超越了现有最先进方法。