TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models

Vision-Language Models (VLMs) have shown impressive performance in vision tasks, but adapting them to new domains often requires expensive fine-tuning. Prompt tuning techniques, including textual, visual, and multimodal prompting, offer efficient alternatives by leveraging learnable prompts. However, their application to Vision-Language Segmentation Models (VLSMs) and evaluation under significant domain shifts remain unexplored. This work presents an open-source benchmarking framework, TuneVLSeg, to integrate various unimodal and multimodal prompt tuning techniques into VLSMs, making prompt tuning usable for downstream segmentation datasets with any number of classes. TuneVLSeg includes $6$ prompt tuning strategies on various prompt depths used in $2$ VLSMs totaling of $8$ different combinations. We test various prompt tuning on $8$ diverse medical datasets, including $3$ radiology datasets (breast tumor, echocardiograph, chest X-ray pathologies) and $5$ non-radiology datasets (polyp, ulcer, skin cancer), and two natural domain segmentation datasets. Our study found that textual prompt tuning struggles under significant domain shifts, from natural-domain images to medical data. Furthermore, visual prompt tuning, with fewer hyperparameters than multimodal prompt tuning, often achieves performance competitive to multimodal approaches, making it a valuable first attempt. Our work advances the understanding and applicability of different prompt-tuning techniques for robust domain-specific segmentation. The source code is available at https://github.com/naamiinepal/tunevlseg.

翻译：视觉语言模型（VLMs）在视觉任务中展现出令人印象深刻的性能，但将其适应到新领域通常需要昂贵的微调。提示调优技术（包括文本、视觉和多模态提示）通过利用可学习的提示，提供了高效的替代方案。然而，这些技术在视觉语言分割模型（VLSMs）中的应用及其在显著领域偏移下的评估仍未得到充分探索。本研究提出了一个开源基准框架TuneVLSeg，旨在将各种单模态和多模态提示调优技术集成到VLSMs中，使得提示调优能够适用于具有任意类别数量的下游分割数据集。TuneVLSeg包含了在$2$种VLSMs中使用的、基于不同提示深度的$6$种提示调优策略，总计$8$种不同组合。我们在$8$个多样化的医学数据集（包括$3$个放射学数据集：乳腺肿瘤、超声心动图、胸部X光病理，以及$5$个非放射学数据集：息肉、溃疡、皮肤癌）和两个自然领域分割数据集上测试了各种提示调优方法。我们的研究发现，在从自然领域图像到医学数据的显著领域偏移下，文本提示调优表现不佳。此外，视觉提示调优的超参数数量少于多模态提示调优，但其性能通常能与多模态方法相竞争，使其成为一种有价值的初步尝试。我们的工作推进了对不同提示调优技术在鲁棒的领域特定分割中适用性的理解与应用。源代码可在 https://github.com/naamiinepal/tunevlseg 获取。