Finetuning a pretrained vision model (PVM) is a common technique for learning downstream vision tasks. The conventional finetuning process with the randomly sampled data points results in diminished training efficiency. To address this drawback, we propose a novel approach, VLM-empowered Collaborative Active Finetuning (VeCAF). VeCAF optimizes a parametric data selection model by incorporating the training objective of the model being tuned. Effectively, this guides the PVM towards the performance goal with improved data and computational efficiency. As vision-language models (VLMs) have achieved significant advancements by establishing a robust connection between image and language domains, we exploit the inherent semantic richness of the text embedding space and utilize text embedding of pretrained VLM models to augment PVM image features for better data selection and finetuning. Furthermore, the flexibility of text-domain augmentation gives VeCAF a unique ability to handle out-of-distribution scenarios without external augmented data. Extensive experiments show the leading performance and high efficiency of VeCAF that is superior to baselines in both in-distribution and out-of-distribution image classification tasks. On ImageNet, VeCAF needs up to 3.3x less training batches to reach the target performance compared to full finetuning and achieves 2.8% accuracy improvement over SOTA methods with the same number of batches.
翻译:对预训练视觉模型进行微调是学习下游视觉任务的常用技术。传统微调过程采用随机采样数据点的方式,导致训练效率低下。针对这一缺陷,我们提出了一种创新方法——VLM赋能协作式主动微调(VeCAF)。VeCAF通过融入目标模型的训练目标来优化参数化数据选择模型。这一机制有效引导预训练视觉模型以更优的数据与计算效率达成性能目标。鉴于视觉语言模型通过建立图像与语言领域的强关联已取得显著进展,我们利用预训练VLM文本嵌入空间固有的语义丰富性,通过其文本嵌入增强PVM图像特征,从而改进数据选择与微调效果。此外,文本域增强的灵活性赋予VeCAF独特优势:无需外部增强数据即可处理分布外场景。大量实验证明,在分布内与分布外图像分类任务中,VeCAF均展现出超越基准方法的卓越性能与高效率。在ImageNet上,VeCAF达到目标性能所需的训练批次比全微调减少3.3倍,且在相同批次数量下比现有最优方法提升2.8%的准确率。