We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better capture these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance among training-free methods, across a diverse set of datasets. Code: https://github.com/vladan-stojnic/LPOSS
翻译:我们提出了一种无需训练的开放词汇语义分割方法,该方法利用视觉-语言模型(VLMs)。我们的方法通过标签传播增强VLM初始的逐块预测,该传播过程通过整合块间关系联合优化预测结果。由于VLM主要针对跨模态对齐而非模态内相似性进行优化,我们采用观测到能更好捕捉此类关系的视觉模型(VM)来实现传播。针对基于块的编码器固有的分辨率限制,我们在像素级应用标签传播作为细化步骤,显著提升了类别边界附近的分割精度。我们提出的方法称为LPOSS+,可对整个图像进行推理,避免了基于窗口的处理方式,从而能够捕捉全图像的上下文交互。LPOSS+在多种数据集上均实现了当前无需训练方法中最先进的性能。代码:https://github.com/vladan-stojnic/LPOSS