Referring image segmentation aims to segment an object referred to by natural language expression from an image. However, this task is challenging due to the distinct data properties between text and image, and the randomness introduced by diverse objects and unrestricted language expression. Most of previous work focus on improving cross-modal feature fusion while not fully addressing the inherent uncertainty caused by diverse objects and unrestricted language. To tackle these problems, we propose an end-to-end Multi-Mask Network for referring image segmentation(MMNet). we first combine picture and language and then employ an attention mechanism to generate multiple queries that represent different aspects of the language expression. We then utilize these queries to produce a series of corresponding segmentation masks, assigning a score to each mask that reflects its importance. The final result is obtained through the weighted sum of all masks, which greatly reduces the randomness of the language expression. Our proposed framework demonstrates superior performance compared to state-of-the-art approaches on the two most commonly used datasets, RefCOCO, RefCOCO+ and G-Ref, without the need for any post-processing. This further validates the efficacy of our proposed framework.
翻译:指代图像分割旨在从图像中分割出自然语言表达式所指代的目标对象。然而,由于文本与图像之间存在显著的数据特性差异,以及多样化目标和无约束语言表达引入的随机性,该任务具有挑战性。以往多数研究侧重于改进跨模态特征融合,而未能充分解决多样化目标和无约束语言表达所导致的内在不确定性。为应对这些问题,我们提出了一种端到端的多掩膜网络用于指代图像分割(MMNet)。我们首先将图像与语言进行联合处理,然后利用注意力机制生成多个查询,这些查询表征语言表达的不同方面。随后,我们利用这些查询生成一系列对应的分割掩膜,并为每个掩膜分配一个反映其重要性的分数。最终结果通过所有掩膜的加权求和获得,这极大地降低了语言表达的随机性。在三个最常用的数据集RefCOCO、RefCOCO+和G-Ref上,我们提出的框架相较于现有最优方法展现出更优越的性能,且无需任何后处理。这进一步验证了我们所提框架的有效性。