Thanks to the impressive progress of large-scale vision-language pretraining, recent recognition models can classify arbitrary objects in a zero-shot and open-set manner, with a surprisingly high accuracy. However, translating this success to semantic segmentation is not trivial, because this dense prediction task requires not only accurate semantic understanding but also fine shape delineation and existing vision-language models are trained with image-level language descriptions. To bridge this gap, we pursue \textbf{shape-aware} zero-shot semantic segmentation in this study. Inspired by classical spectral methods in the image segmentation literature, we propose to leverage the eigen vectors of Laplacian matrices constructed with self-supervised pixel-wise features to promote shape-awareness. Despite that this simple and effective technique does not make use of the masks of seen classes at all, we demonstrate that it out-performs a state-of-the-art shape-aware formulation that aligns ground truth and predicted edges during training. We also delve into the performance gains achieved on different datasets using different backbones and draw several interesting and conclusive observations: the benefits of promoting shape-awareness highly relates to mask compactness and language embedding locality. Finally, our method sets new state-of-the-art performance for zero-shot semantic segmentation on both Pascal and COCO, with significant margins. Code and models will be accessed at https://github.com/Liuxinyv/SAZS.
翻译:得益于大规模视觉-语言预训练取得的显著进展,近期识别模型能够以零样本和开放集方式对任意物体进行分类,且准确率惊人。然而,将这一成功迁移至语义分割并非易事,因为这一密集预测任务不仅需要精确的语义理解,还需精细的形状描绘,而现有视觉-语言模型均基于图像级语言描述进行训练。为弥补这一差距,本研究致力于实现**形状感知**的零样本语义分割。受图像分割文献中经典谱方法的启发,我们提出利用自监督像素级特征构建的拉普拉斯矩阵的特征向量来增强形状感知能力。尽管这一简单高效的技术完全未使用已见类别的掩码,我们却证明其性能优于在训练过程中对齐真实边缘与预测边缘的最先进形状感知方法。我们还深入探讨了使用不同骨干网络在不同数据集上取得的性能提升,并得出若干有趣且结论性的发现:形状感知能力的提升效果与掩码紧致性及语言嵌入局部性高度相关。最终,我们的方法在Pascal和COCO数据集上均以显著优势刷新了零样本语义分割的最新性能。代码和模型将发布在 https://github.com/Liuxinyv/SAZS。