Image generators are gaining vast amount of popularity and have rapidly changed how digital content is created. With the latest AI technology, millions of high quality images are being generated by the public, which are constantly motivating the research community to push the limits of generative models to create more complex and realistic images. This paper focuses on Cross-Domain Image Retrieval (CDIR) which can be used as an additional tool to inspect collections of generated images by determining the level of similarity between images in a dataset. An ideal retrieval system would be able to generalize to unseen complex images from multiple domains (e.g., photos, drawings and paintings). To address this goal, we propose a novel caption-matching approach that leverages multimodal language-vision architectures pre-trained on large datasets. The method is tested on DomainNet and Office-Home datasets and consistently achieves state-of-the-art performance over the latest approaches in the literature for cross-domain image retrieval. In order to verify the effectiveness with AI-generated images, the method was also put to test with a database composed by samples collected from Midjourney, which is a widely used generative platform for content creation.
翻译:图像生成器正日益普及,并迅速改变了数字内容的创建方式。借助最新的人工智能技术,公众可生成数百万张高质量图像,这持续激励研究社区突破生成模型的极限,以创建更复杂、更逼真的图像。本文聚焦于跨域图像检索(CDIR),其可作为辅助工具,通过判定数据集中图像间的相似度来审查生成的图像集合。理想的检索系统应能泛化至来自多个领域(如照片、素描和绘画)的未见复杂图像。为实现这一目标,我们提出一种新颖的标题匹配方法,该方法利用在大规模数据集上预训练的多模态语言-视觉架构。所提方法在DomainNet和Office-Home数据集上进行了测试,并在跨域图像检索任务中持续达到优于文献中最先进方法的性能。为验证其对AI生成图像的有效性,该方法还在一组从广泛使用的内容生成平台Midjourney采集的样本数据库上进行了测试。