We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better capture these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance among training-free methods, across a diverse set of datasets. Code: https://github.com/vladan-stojnic/LPOSS
翻译:我们提出了一种无需训练的开放词汇语义分割方法,该方法基于视觉-语言模型。我们的方法通过标签传播增强视觉-语言模型的初始逐块预测,该传播过程通过整合块间关系联合优化预测结果。由于视觉-语言模型主要针对跨模态对齐而非模态内相似性进行优化,我们采用能更好捕捉此类关系的视觉模型来处理块间关系。针对基于块的编码器固有的分辨率限制,我们在像素级应用标签传播作为细化步骤,显著提升了类别边界附近的分割精度。我们的方法称为LPOSS+,可对整个图像进行推理,避免了基于窗口的处理方式,从而能够捕捉全图像的上下文交互。LPOSS+在多种数据集上实现了当前无需训练方法中的最优性能。代码地址:https://github.com/vladan-stojnic/LPOSS