Contrastive Language-Image Pretraining (CLIP) models are able to capture the semantic relationship of images and texts and have enabled a wide range of applications, from image retrieval to classification. These models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores whether limited amounts higher quality data in a specific domain improve the general performance of CLIP models. To this purpose, we extract text-image data from scientific papers hosted in the arXiv and PubMed Central repositories. Experiments on small-scale CLIP models (ViT B/32) show that model performance increases on average, but only moderately. This result indicates that using the data sources considered in the paper to train large-scale CLIP models is a worthwile research direction.
翻译:对比语言-图像预训练(CLIP)模型能够捕捉图像与文本的语义关联,并已支持从图像检索到分类的广泛应用。这些模型通过从网络爬取的数据集进行训练,此类数据虽数量庞大但质量有限。本文探究特定领域中有限的高质量数据是否能提升CLIP模型的整体性能。为此,我们从arXiv和PubMed Central存储库中的科学论文提取图文数据。针对小规模CLIP模型(ViT B/32)的实验表明,模型性能平均有所提升,但提升幅度有限。这一结果表明,利用本文所考虑的数据源来训练大规模CLIP模型是一个值得的研究方向。