Vision and Language Models (VLMs), such as CLIP, have enabled visual recognition of a potentially unlimited set of categories described by text prompts. However, for the best visual recognition performance, these models still require tuning to better fit the data distributions of the downstream tasks, in order to overcome the domain shift from the web-based pre-training data. Recently, it has been shown that it is possible to effectively tune VLMs without any paired data, and in particular to effectively improve VLMs visual recognition performance using text-only training data generated by Large Language Models (LLMs). In this paper, we dive deeper into this exciting text-only VLM training approach and explore ways it can be significantly further improved taking the specifics of the downstream task into account when sampling text data from LLMs. In particular, compared to the SOTA text-only VLM training approach, we demonstrate up to 8.4% performance improvement in (cross) domain-specific adaptation, up to 8.7% improvement in fine-grained recognition, and 3.1% overall average improvement in zero-shot classification compared to strong baselines.
翻译:摘要:视觉-语言模型(如CLIP)通过文本提示实现了对潜在无限类别集合的视觉识别。然而,为获得最优的视觉识别性能,这些模型仍需针对下游任务的数据分布进行调优,以克服基于网络预训练数据带来的领域偏移。近期研究表明,无需配对数据即可有效调优视觉-语言模型,特别是利用大型语言模型生成的纯文本训练数据可显著提升其视觉识别性能。本文深入探究这一令人振奋的纯文本视觉-语言模型训练方法,并探索在从大型语言模型采样文本数据时,如何通过考虑下游任务的具体特性实现显著性能提升。具体而言,与当前最先进的纯文本视觉-语言模型训练方法相比,我们在(跨)领域自适应任务中实现了高达8.4%的性能提升,在细粒度识别任务中提升达8.7%,并在零样本分类任务中相较强基线方法实现了3.1%的整体平均性能提升。