In semantic segmentation, adapting a visual system to novel object categories at inference time has always been both valuable and challenging. To enable such generalization, existing methods rely on either providing several support examples as visual cues or class names as textual cues. Through the development is relatively optimistic, these two lines have been studied in isolation, neglecting the complementary intrinsic of low-level visual and high-level language information. In this paper, we define a unified setting termed as open-set semantic segmentation (O3S), which aims to learn seen and unseen semantics from both visual examples and textual names. Our pipeline extracts multi-modal prototypes for segmentation task, by first single modal self-enhancement and aggregation, then multi-modal complementary fusion. To be specific, we aggregate visual features into several tokens as visual prototypes, and enhance the class name with detailed descriptions for textual prototype generation. The two modalities are then fused to generate multi-modal prototypes for final segmentation. On both \pascal and \coco datasets, we conduct extensive experiments to evaluate the framework effectiveness. State-of-the-art results are achieved even on more detailed part-segmentation, Pascal-Animals, by only training on coarse-grained datasets. Thorough ablation studies are performed to dissect each component, both quantitatively and qualitatively.
翻译:在语义分割中,在推理时将视觉系统适应于新物体类别始终既具价值又充满挑战。为实现这种泛化,现有方法要么依赖提供若干支持示例作为视觉线索,要么依赖提供类别名称作为文本线索。尽管发展较为乐观,但这两条研究路径一直被孤立地探讨,忽略了低层视觉信息与高层语言信息的互补本质。本文定义了一种统一设定——开放集语义分割(O3S),旨在同时从视觉示例和文本名称中学习可见与不可见语义。我们的流程通过先进行单模态自增强与聚合,再执行多模态互补融合,为分割任务提取多模态原型。具体而言,我们将视觉特征聚合成若干令牌作为视觉原型,并用详细描述增强类别名称以生成文本原型。随后融合两种模态生成多模态原型,用于最终分割。在PASCAL和COCO数据集上,我们开展了广泛实验以评估框架有效性。即使仅使用粗粒度数据集训练,在更精细的部分分割任务Pascal-Animals上也取得了最先进的结果。通过定量和定性的方式,我们开展了全面消融研究来剖析每个组件。