With the emergence of pretrained vision-language models (VLMs), considerable efforts have been devoted to fine-tuning them for downstream tasks. Despite the progress made in designing efficient fine-tuning methods, such methods require access to the model's parameters, which can be challenging as model owners often opt to provide their models as a black box to safeguard model ownership. This paper proposes a \textbf{C}ollabo\textbf{ra}tive \textbf{F}ine-\textbf{T}uning (\textbf{CraFT}) approach for fine-tuning black-box VLMs to downstream tasks, where one only has access to the input prompts and the output predictions of the model. CraFT comprises two modules, a prompt generation module for learning text prompts and a prediction refinement module for enhancing output predictions in residual style. Additionally, we introduce an auxiliary prediction-consistent loss to promote consistent optimization across these modules. These modules are optimized by a novel collaborative training algorithm. Extensive experiments on few-shot classification over 15 datasets demonstrate the superiority of CraFT. The results show that CraFT achieves a decent gain of about 12\% with 16-shot datasets and only 8,000 queries. Moreover, CraFT trains faster and uses only about 1/80 of the memory footprint for deployment, while sacrificing only 1.62\% compared to the white-box method. Our code is publicly available at https://github.com/mrflogs/CraFT .
翻译:随着预训练视觉语言模型的出现,大量研究致力于将其微调以适应下游任务。尽管在设计高效微调方法方面已取得进展,但此类方法需要访问模型参数,而模型所有者常将模型以黑盒形式提供以保护所有权,这带来了挑战。本文提出一种面向黑盒视觉语言模型的**协同微调**方法,该方法仅利用模型输入提示与输出预测进行下游任务适配。该框架包含两个模块:用于学习文本提示的提示生成模块,以及以残差形式增强输出预测的预测优化模块。此外,我们引入辅助预测一致性损失以促进模块间的协同优化。这些模块通过新颖的协同训练算法进行优化。在涵盖15个数据集的少样本分类任务上的大量实验证明了该方法的优越性。实验结果表明,在16样本数据集上仅使用8,000次查询即可实现约12%的性能提升。同时,与白盒方法相比,该方法训练速度更快、部署时内存占用仅为其约1/80,而性能损失仅为1.62%。代码已公开于https://github.com/mrflogs/CraFT。