Zero-shot classification capabilities naturally arise in models trained within a vision-language contrastive framework. Despite their classification prowess, these models struggle in dense tasks like zero-shot open-vocabulary segmentation. This deficiency is often attributed to the absence of localization cues in captions and the intertwined nature of the learning process, which encompasses both image representation learning and cross-modality alignment. To tackle these issues, we propose SimZSS, a Simple framework for open-vocabulary Zero-Shot Segmentation. The method is founded on two key principles: i) leveraging frozen vision-only models that exhibit spatial awareness while exclusively aligning the text encoder and ii) exploiting the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions. By capitalizing on the quality of the visual representations, our method requires only image-caption pairs datasets and adapts to both small curated and large-scale noisy datasets. When trained on COCO Captions across 8 GPUs, SimZSS achieves state-of-the-art results on 7 out of 8 benchmark datasets in less than 15 minutes.
翻译:在视觉-语言对比框架下训练的模型自然具备零样本分类能力。尽管这些模型在分类任务上表现出色,但在零样本开放词汇分割等密集预测任务中却表现不佳。这种缺陷通常归因于图像描述中缺乏定位线索,以及学习过程同时涵盖图像表示学习和跨模态对齐的耦合性。为解决这些问题,我们提出了SimZSS,一种用于开放词汇零样本分割的简单框架。该方法基于两个关键原则:i) 利用具有空间感知能力的冻结纯视觉模型,同时仅对齐文本编码器;ii) 利用文本的离散性和语言学知识来精确定位描述中的局部概念。通过充分利用视觉表示的质量,我们的方法仅需图像-描述对数据集,并能适应精心标注的小型数据集和大规模噪声数据集。在8个GPU上使用COCO Captions数据集进行训练时,SimZSS在不到15分钟内于8个基准数据集中的7个上取得了最先进的结果。