Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision-language models (VLMs) like CLIP, their reliance on image-level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region-level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single-stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate its competitive performance and strong generalization in open-vocabulary settings.
翻译:开集词汇语义分割(OVSS)通过使用任意文本描述对可见与不可见类别进行像素级标注,扩展了传统闭集分割。现有方法虽利用CLIP等视觉语言模型(VLM),但其对图像级预训练的依赖常导致空间对齐不精确,在模糊或杂乱场景中产生误匹配分割。然而,多数现有方法缺乏强目标先验和区域级约束,易引发目标幻觉或漏检,进一步降低性能。为解决这些挑战,我们提出LoGoSeg——一个高效的单阶段框架,其整合了三项关键创新:(i)通过全局图文相似性动态加权相关类别的目标存在先验,有效抑制幻觉;(ii)建立精确区域级视觉-文本对应关系的区域感知对齐模块;(iii)将局部结构信息与全局语义上下文最优结合的双流融合机制。与先前工作不同,LoGoSeg无需外部掩码建议、额外骨干网络或附加数据集,确保了高效性。在六个基准数据集(A-847、PC-459、A-150、PC-59、PAS-20和PAS-20b)上的大量实验表明,该方法在开集词汇设定下具有竞争优势和强泛化能力。