The pre-trained vision-language model, exemplified by CLIP, advances zero-shot semantic segmentation by aligning visual features with class embeddings through a transformer decoder to generate semantic masks. Despite its effectiveness, prevailing methods within this paradigm encounter challenges, including overfitting on seen classes and small fragmentation in masks. To mitigate these issues, we propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.Specifically, we leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings. Moreover, to circumvent noisy alignments from the vision part due to its redundant nature, we introduce route attention into self-attention for finding visual consensus, thereby enhancing semantic consistency within the same object. Equipped with a vision-language prompting strategy, our approach significantly boosts the generalization capacity of segmentation models for unseen classes. Experimental results underscore the effectiveness of our approach, showcasing mIoU gains of 4.5 on the PASCAL VOC 2012 and 3.6 on the COCO-Stuff 164k for unseen classes compared with the state-of-the-art methods.
翻译:预训练的视觉-语言模型(以CLIP为代表)通过Transformer解码器将视觉特征与类别嵌入对齐以生成语义掩码,推动了零样本语义分割的发展。尽管该方法有效,但当前范式下的主流方法仍面临对已见类别过拟合以及掩码碎片化等问题。为缓解这些挑战,我们提出语言驱动的视觉共识(LDVC)方法,旨在促进语义与视觉信息更优的对齐。具体而言,我们利用类别嵌入的离散性与抽象性特征,将其作为锚点引导视觉特征向其收敛。此外,为规避视觉部分因冗余特性导致的噪声对齐,我们在自注意力机制中引入路由注意力(Route Attention)以寻找视觉共识,从而增强同一目标内部的语义一致性。结合视觉-语言提示策略,我们的方法显著提升了分割模型对未见过类别的泛化能力。实验结果表明,与现有最优方法相比,本方法在PASCAL VOC 2012和COCO-Stuff 164k数据集上对未见过类别的mIoU分别提升4.5和3.6,验证了其有效性。