Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets, especially thanks to their free-text inputs. However, they struggle to handle some downstream tasks, such as fine-grained attribute detection and localization. In this paper, we propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further leverage the capacities of the vision-language foundation models. Using the CLIP architecture as baseline, we show strong improvements on bird fine-grained attribute detection and localization tasks, while also increasing the classification performance on the CUB200-2011 dataset. We provide source code for reproducibility purposes: it is available at https://github.com/FactoDeepLearning/MultitaskVLFM.
翻译:基于视觉-语言的基础模型(如CLIP)凭借其自由文本输入能力,在多项任务和数据集上展现出惊人的零样本性能。然而,这些模型在处理细粒度属性检测与定位等特定下游任务时仍面临挑战。本文提出一种基于正/负提示构建的多任务微调策略,旨在进一步挖掘视觉-语言基础模型的潜力。以CLIP架构为基线,我们在鸟类细粒度属性检测与定位任务上取得了显著提升,同时提高了CUB200-2011数据集的分类性能。为便于结果复现,我们提供了开源代码:https://github.com/FactoDeepLearning/MultitaskVLFM。