Referring Expression Segmentation (RES) aims to generate a segmentation mask for the object described by a given language expression. Existing classic RES datasets and methods commonly support single-target expressions only, i.e., one expression refers to one target object. Multi-target and no-target expressions are not considered. This limits the usage of RES in practice. In this paper, we introduce a new benchmark called Generalized Referring Expression Segmentation (GRES), which extends the classic RES to allow expressions to refer to an arbitrary number of target objects. Towards this, we construct the first large-scale GRES dataset called gRefCOCO that contains multi-target, no-target, and single-target expressions. GRES and gRefCOCO are designed to be well-compatible with RES, facilitating extensive experiments to study the performance gap of the existing RES methods on the GRES task. In the experimental study, we find that one of the big challenges of GRES is complex relationship modeling. Based on this, we propose a region-based GRES baseline ReLA that adaptively divides the image into regions with sub-instance clues, and explicitly models the region-region and region-language dependencies. The proposed approach ReLA achieves new state-of-the-art performance on the both newly proposed GRES and classic RES tasks. The proposed gRefCOCO dataset and method are available at https://henghuiding.github.io/GRES.
翻译:指代表达式分割(RES)旨在根据给定语言表达式所描述的对象生成分割掩码。现有经典RES数据集和方法通常仅支持单目标表达式,即一个表达式指代一个目标对象,而未考虑多目标与无目标表达式。这限制了RES在实际场景中的应用。本文提出一个名为广义指代表达式分割(GRES)的新基准,将经典RES扩展至允许表达式指代任意数量的目标对象。为此,我们构建了首个大规模GRES数据集gRefCOCO,涵盖多目标、无目标及单目标表达式。GRES与gRefCOCO被设计为与RES高度兼容,便于开展广泛实验以研究现有RES方法在GRES任务上的性能差距。实验发现,GRES面临的主要挑战之一在于复杂关系建模。基于此,我们提出一种基于区域的GRES基线方法ReLA,该方法通过子实例线索自适应划分图像区域,并显式建模区域-区域与区域-语言依赖关系。所提方法ReLA在最新的GRES与经典RES任务上均取得最优性能。gRefCOCO数据集与方法已开源至https://henghuiding.github.io/GRES。