We present a technique for segmenting real and AI-generated images using latent diffusion models (LDMs) trained on internet-scale datasets. First, we show that the latent space of LDMs (z-space) is a better input representation compared to other feature representations like RGB images or CLIP encodings for text-based image segmentation. By training the segmentation models on the latent z-space, which creates a compressed representation across several domains like different forms of art, cartoons, illustrations, and photographs, we are also able to bridge the domain gap between real and AI-generated images. We show that the internal features of LDMs contain rich semantic information and present a technique in the form of LD-ZNet to further boost the performance of text-based segmentation. Overall, we show up to 6% improvement over standard baselines for text-to-image segmentation on natural images. For AI-generated imagery, we show close to 20% improvement compared to state-of-the-art techniques.
翻译:我们提出了一种利用在大规模互联网数据集上训练的潜在扩散模型(LDM)对真实图像和AI生成图像进行分割的技术。首先,我们证明与RGB图像或CLIP编码等其他特征表示相比,LDM的潜在空间(z空间)是文本引导图像分割更优的输入表征。通过在潜在z空间上训练分割模型——该空间在多个领域(如不同艺术形式、漫画、插图和照片)创建了压缩表示——我们还能够弥合真实图像与AI生成图像之间的领域差距。研究表明,LDM的内部特征包含丰富的语义信息,我们以LD-ZNet的形式提出了一种技术,以进一步提升文本引导分割的性能。总体而言,在自然图像的文本到图像分割任务中,我们相比标准基线实现了高达6%的性能提升;对于AI生成图像,我们相比现有最优技术实现了近20%的改进。