Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions. However, complex visual scenes can be naturally decomposed into simpler parts and abstracted at multiple levels of granularity, introducing inherent segmentation ambiguity. Unlike existing methods that typically sidestep this ambiguity and treat it as an external factor, our approach actively incorporates a hierarchical representation encompassing different semantic-levels into the learning process. We propose a decoupled text-image fusion mechanism and representation learning modules for both "things" and "stuff".1 Additionally, we systematically examine the differences that exist in the textual and visual features between these types of categories. Our resulting model, named HIPIE, tackles HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a unified framework. Benchmarked on over 40 datasets, e.g., ADE20K, COCO, Pascal-VOC Part, RefCOCO/RefCOCOg, ODinW and SeginW, HIPIE achieves the state-of-the-art results at various levels of image comprehension, including semantic-level (e.g., semantic segmentation), instance-level (e.g., panoptic/referring segmentation and object detection), as well as part-level (e.g., part/subpart segmentation) tasks. Our code is released at https://github.com/berkeley-hipie/HIPIE.
翻译:开放词汇图像分割旨在根据任意文本描述将图像分割为语义区域。然而,复杂视觉场景可被自然地分解为更简单的部分,并在多个粒度层级上进行抽象,这引入了固有的分割歧义。与现有方法通常回避这种歧义并将其视为外部因素不同,我们的方法主动将涵盖不同语义层级的层次化表示融入学习过程中。我们提出了一种解耦的文本-图像融合机制,以及针对“物体”和“材料”的表征学习模块。此外,我们系统性地考察了这两类类别在文本与视觉特征上的差异。由此产生的模型名为HIPIE,在统一框架内解决了层级化、开放词汇和通用分割任务。在超过40个数据集(例如ADE20K、COCO、Pascal-VOC Part、RefCOCO/RefCOCOg、ODinW和SeginW)上的基准测试中,HIPIE在多种图像理解层级上取得了最先进的结果,包括语义层级(如语义分割)、实例层级(如全景/指代分割与目标检测)以及部件层级(如部件/子部件分割)任务。我们的代码已发布于https://github.com/berkeley-hipie/HIPIE。