With the emergence of pretrained vision-language models (VLMs), considerable efforts have been devoted to fine-tuning them for downstream tasks. Despite the progress made in designing efficient fine-tuning methods, such methods require access to the model's parameters, which can be challenging as model owners often opt to provide their models as a black box to safeguard model ownership. This paper proposes a \textbf{C}ollabo\textbf{ra}tive \textbf{F}ine-\textbf{T}uning (\textbf{CraFT}) approach for fine-tuning black-box VLMs to downstream tasks, where one only has access to the input prompts and the output predictions of the model. CraFT comprises two modules, a prompt generation module for learning text prompts and a prediction refinement module for enhancing output predictions in residual style. Additionally, we introduce an auxiliary prediction-consistent loss to promote consistent optimization across these modules. These modules are optimized by a novel collaborative training algorithm. Extensive experiments on few-shot classification over 15 datasets demonstrate the superiority of CraFT. The results show that CraFT achieves a decent gain of about 12\% with 16-shot datasets and only 8,000 queries. Moreover, CraFT trains faster and uses only about 1/80 of the memory footprint for deployment, while sacrificing only 1.62\% compared to the white-box method.
翻译:随着预训练视觉语言模型(VLM)的出现,大量研究致力于将其微调至下游任务。尽管高效微调方法的设计取得了进展,但这些方法需要访问模型参数,而模型所有者通常选择以黑盒形式提供模型以保护模型所有权,这给微调带来了挑战。本文提出了一种**协同微调**(CraFT)方法,用于对黑盒VLM进行下游任务微调,其中我们仅能访问模型的输入提示和输出预测。CraFT包含两个模块:提示生成模块用于学习文本提示,预测精化模块用于以残差形式增强输出预测。此外,我们引入一种辅助的预测一致性损失,以促进这两个模块之间的协同优化。这些模块通过一种新颖的协同训练算法进行优化。在15个数据集上的少样本分类实验充分证明了CraFT的优越性。结果表明,CraFT在16样本数据集上仅需8000次查询即可实现约12%的显著性能提升。此外,CraFT训练速度更快,部署内存占用仅为白盒方法的约1/80,同时性能损失仅1.62%。