Existing Referring Image Segmentation (RIS) methods typically require expensive pixel-level or box-level annotations for supervision. In this paper, we observe that the referring texts used in RIS already provide sufficient information to localize the target object. Hence, we propose a novel weakly-supervised RIS framework to formulate the target localization problem as a classification process to differentiate between positive and negative text expressions. While the referring text expressions for an image are used as positive expressions, the referring text expressions from other images can be used as negative expressions for this image. Our framework has three main novelties. First, we propose a bilateral prompt method to facilitate the classification process, by harmonizing the domain discrepancy between visual and linguistic features. Second, we propose a calibration method to reduce noisy background information and improve the correctness of the response maps for target object localization. Third, we propose a positive response map selection strategy to generate high-quality pseudo-labels from the enhanced response maps, for training a segmentation network for RIS inference. For evaluation, we propose a new metric to measure localization accuracy. Experiments on four benchmarks show that our framework achieves promising performances to existing fully-supervised RIS methods while outperforming state-of-the-art weakly-supervised methods adapted from related areas. Code is available at https://github.com/fawnliu/TRIS.
翻译:现有指代图像分割(RIS)方法通常需要昂贵的像素级或框级标注作为监督信息。本文观察到,RIS中使用的指代文本已提供足够信息来定位目标对象。为此,我们提出一种新颖的弱监督RIS框架,将目标定位问题形式化为区分正负文本表达的分类过程。对于一张图像,其指代文本表达作为正例,而从其他图像获取的指代文本表达则作为该图像的负例。本框架具有三大创新:首先,提出双边提示方法,通过协调视觉与语言特征之间的领域差异以促进分类过程;其次,提出校准方法以减少背景噪声信息,提升目标定位响应图的正确性;第三,提出正响应图选择策略,从增强后的响应图中生成高质量伪标签,用于训练RIS推理的分割网络。在评估方面,我们提出新指标衡量定位精度。四个基准实验表明,本框架在达到现有全监督RIS方法性能的同时,超越了相关领域迁移的最优弱监督方法。代码发布于https://github.com/fawnliu/TRIS。