We study universal zero-shot segmentation in this work to achieve panoptic, instance, and semantic segmentation for novel categories without any training samples. Such zero-shot segmentation ability relies on inter-class relationships in semantic space to transfer the visual knowledge learned from seen categories to unseen ones. Thus, it is desired to well bridge semantic-visual spaces and apply the semantic relationships to visual feature learning. We introduce a generative model to synthesize features for unseen categories, which links semantic and visual spaces as well as addresses the issue of lack of unseen training data. Furthermore, to mitigate the domain gap between semantic and visual spaces, firstly, we enhance the vanilla generator with learned primitives, each of which contains fine-grained attributes related to categories, and synthesize unseen features by selectively assembling these primitives. Secondly, we propose to disentangle the visual feature into the semantic-related part and the semantic-unrelated part that contains useful visual classification clues but is less relevant to semantic representation. The inter-class relationships of semantic-related visual features are then required to be aligned with those in semantic space, thereby transferring semantic knowledge to visual feature learning. The proposed approach achieves impressively state-of-the-art performance on zero-shot panoptic segmentation, instance segmentation, and semantic segmentation. Code is available at https://henghuiding.github.io/PADing/.
翻译:本文研究通用零样本分割任务,旨在无需任何训练样本即可实现对未知类别的全景分割、实例分割和语义分割。此类零样本分割能力依赖于语义空间中类别间的关系,以将可见类别中学到的视觉知识迁移至未见类别。因此,有效桥接语义-视觉空间并将语义关系应用于视觉特征学习至关重要。我们引入生成模型为未见类别合成特征,该模型不仅连接语义与视觉空间,还解决了缺乏未见训练数据的问题。此外,为缓解语义与视觉空间之间的域差距,我们首先通过可学习基元(每个基元包含与类别相关的细粒度属性)增强原始生成器,通过选择性组合这些基元合成未见特征。其次,我们提出将视觉特征解耦为语义相关部分和语义无关部分——后者包含对视觉分类有用的线索但与语义表征关联较弱。进而要求语义相关视觉特征的类别间关系与语义空间中的关系对齐,从而将语义知识迁移至视觉特征学习。本方法在零样本全景分割、实例分割及语义分割任务上均取得了显著的最优性能。代码见https://henghuiding.github.io/PADing/。