Open-vocabulary semantic segmentation enables models to segment objects or image regions beyond fixed class sets, offering flexibility in dynamic environments. However, existing methods often rely on single-view images and struggle with spatial precision, especially under occlusions and near object boundaries. We propose SENSE, the first work on Stereo OpEN Vocabulary SEmantic Segmentation, which leverages stereo vision and vision-language models to enhance open-vocabulary semantic segmentation. By incorporating stereo image pairs, we introduce geometric cues that improve spatial reasoning and segmentation accuracy. Trained on the PhraseStereo dataset, our approach achieves strong performance in phrase-grounded tasks and demonstrates generalization in zero-shot settings. On PhraseStereo, we show a +2.9% improvement in Average Precision over the baseline method and +0.76% over the best competing method. SENSE also provides a relative improvement of +3.5% mIoU on Cityscapes and +18% on KITTI compared to the baseline work. By jointly reasoning over semantics and geometry, SENSE supports accurate scene understanding from natural language, essential for autonomous robots and Intelligent Transportation Systems.
翻译:开放词汇语义分割使模型能够分割超出固定类别集合的目标或图像区域,在动态环境中提供灵活性。然而,现有方法通常依赖单视图图像,在空间精度上存在不足,尤其在遮挡和目标边界附近。我们提出SENSE,首个关于立体开放词汇语义分割的工作,利用立体视觉和视觉语言模型增强开放词汇语义分割。通过结合立体图像对,我们引入几何线索以提升空间推理和分割精度。在PhraseStereo数据集上训练后,我们的方法在短语引导任务中表现出色,并展示了在零样本设置中的泛化能力。在PhraseStereo上,我们相较于基线方法平均精度提升了+2.9%,相较于最优竞争方法提升了+0.76%。SENSE在Cityscapes上相比基线工作mIoU相对提升了+3.5%,在KITTI上提升了+18%。通过联合推理语义和几何信息,SENSE支持基于自然语言的精确场景理解,这对自主机器人和智能交通系统至关重要。