The advent of Vision Language Models (VLMs) transformed image understanding from closed-set classifications to dynamic image-language interactions, enabling open-vocabulary segmentation. Despite this flexibility, VLMs often fall behind closed-set classifiers in accuracy due to their reliance on ambiguous image captions and lack of domain-specific knowledge. We, therefore, introduce a new task domain adaptation for open-vocabulary segmentation, enhancing VLMs with domain-specific priors while preserving their open-vocabulary nature. Existing adaptation methods, when applied to segmentation tasks, improve performance on training queries but can reduce VLM performance on zero-shot text inputs. To address this shortcoming, we propose an approach that combines parameter-efficient prompt tuning with a triplet-loss-based training strategy. This strategy is designed to enhance open-vocabulary generalization while adapting to the visual domain. Our results outperform other parameter-efficient adaptation strategies in open-vocabulary segment classification tasks across indoor and outdoor datasets. Notably, our approach is the only one that consistently surpasses the original VLM on zero-shot queries. Our adapted VLMs can be plug-and-play integrated into existing open-vocabulary segmentation pipelines, improving OV-Seg by +6.0% mIoU on ADE20K, and OpenMask3D by +4.1% AP on ScanNet++ Offices without any changes to the methods.
翻译:视觉语言模型(Vision Language Models, VLMs)的出现将图像理解从封闭集分类转变为动态的图像-语言交互,从而实现了开放词汇分割。尽管具有这种灵活性,但由于依赖模糊的图像描述且缺乏领域特定知识,VLM在准确性上往往落后于封闭集分类器。为此,我们引入了一项新任务——面向开放词汇分割的领域自适应,旨在通过融入领域先验知识来增强VLM,同时保持其开放词汇特性。现有自适应方法应用于分割任务时,虽能提升训练查询的性能,却可能降低VLM在零样本文本输入上的表现。为克服这一不足,我们提出了一种结合参数高效提示调优与基于三元组损失训练策略的方法。该策略旨在适应视觉领域的同时,增强开放词汇泛化能力。在室内外数据集上的开放词汇分割分类任务中,我们的方法优于其他参数高效自适应策略。值得注意的是,我们的方法是唯一能在零样本查询上持续超越原始VLM的方案。经自适应的VLM可直接即插即用地集成到现有开放词汇分割流程中,在无需修改原方法的情况下,将OV-Seg在ADE20K数据集上的mIoU提升+6.0%,并将OpenMask3D在ScanNet++ Offices数据集上的AP提升+4.1%。