Given the recent advances in multimodal image pretraining where visual models trained with semantically dense textual supervision tend to have better generalization capabilities than those trained using categorical attributes or through unsupervised techniques, in this work we investigate how recent CLIP model can be applied in several tasks in artwork domain. We perform exhaustive experiments on the NoisyArt dataset which is a dataset of artwork images crawled from public resources on the web. On such dataset CLIP achieves impressive results on (zero-shot) classification and promising results in both artwork-to-artwork and description-to-artwork domain.
翻译:鉴于近期多模态图像预训练的进展——通过语义密集的文本监督训练的视觉模型相比使用类别属性或无监督技术训练的模型具有更强的泛化能力——本文探究了如何将最新的CLIP模型应用于艺术品领域的多项任务中。我们在NoisyArt数据集(从网络公共资源爬取的艺术品图像数据集)上开展了详尽的实验。在该数据集上,CLIP在(零样本)分类任务中取得了显著成果,并在艺术品与艺术品之间、描述与艺术品之间的跨域检索中展现了良好性能。