Large Vision-Language Models (VLMs) are increasingly being regarded as foundation models that can be instructed to solve diverse tasks by prompting, without task-specific training. We examine the seemingly obvious question: how to effectively prompt VLMs for semantic segmentation. To that end, we systematically evaluate the segmentation performance of several recent models guided by either text or visual prompts on the out-of-distribution MESS dataset collection. We introduce a scalable prompting scheme, few-shot prompted semantic segmentation, inspired by open-vocabulary segmentation and few-shot learning. It turns out that VLMs lag far behind specialist models trained for a specific segmentation task, by about 30% on average on the Intersection-over-Union metric. Moreover, we find that text prompts and visual prompts are complementary: each one of the two modes fails on many examples that the other one can solve. Our analysis suggests that being able to anticipate the most effective prompt modality can lead to a 11% improvement in performance. Motivated by our findings, we propose PromptMatcher, a remarkably simple training-free baseline that combines both text and visual prompts, achieving state-of-the-art results outperforming the best text-prompted VLM by 2.5%, and the top visual-prompted VLM by 3.5% on few-shot prompted semantic segmentation.
翻译:大型视觉语言模型(VLMs)正日益被视为基础模型,可通过提示指令解决多样任务,而无需特定任务训练。我们研究了一个看似显而易见的问题:如何有效提示VLMs进行语义分割。为此,我们系统评估了若干近期模型在分布外MESS数据集集合上,受文本提示或视觉提示引导的分割性能。受开放词汇分割和少样本学习启发,我们提出了一种可扩展的提示方案——少样本提示语义分割。结果表明,在交并比指标上,VLMs平均落后于针对特定分割任务训练的专业模型约30%。此外,我们发现文本提示与视觉提示具有互补性:两种模式各自在许多对方能解决的案例上失效。我们的分析表明,若能预判最有效的提示模态,可带来11%的性能提升。基于这些发现,我们提出了PromptMatcher——一个极其简单的免训练基线方法,它结合了文本与视觉提示,在少样本提示语义分割任务中实现了最先进的结果,以2.5%的优势超越最佳文本提示VLM,以3.5%的优势超越最佳视觉提示VLM。