Text-to-image diffusion models have made significant advances in generating and editing high-quality images. As a result, numerous approaches have explored the ability of diffusion model features to understand and process single images for downstream tasks, e.g., classification, semantic segmentation, and stylization. However, significantly less is known about what these features reveal across multiple, different images and objects. In this work, we exploit Stable Diffusion (SD) features for semantic and dense correspondence and discover that with simple post-processing, SD features can perform quantitatively similar to SOTA representations. Interestingly, the qualitative analysis reveals that SD features have very different properties compared to existing representation learning features, such as the recently released DINOv2: while DINOv2 provides sparse but accurate matches, SD features provide high-quality spatial information but sometimes inaccurate semantic matches. We demonstrate that a simple fusion of these two features works surprisingly well, and a zero-shot evaluation using nearest neighbors on these fused features provides a significant performance gain over state-of-the-art methods on benchmark datasets, e.g., SPair-71k, PF-Pascal, and TSS. We also show that these correspondences can enable interesting applications such as instance swapping in two images.
翻译:文本到图像扩散模型在生成和编辑高质量图像方面取得了显著进展。因此,大量研究探索了扩散模型特征在理解和处理单张图像以完成下游任务(如分类、语义分割和风格化)中的能力。然而,关于这些特征如何揭示多张不同图像及物体间的信息,目前知之甚少。在本工作中,我们利用稳定扩散(Stable Diffusion, SD)特征实现语义和密集对应,发现通过简单的后处理,SD特征能在数值上达到与当前最优表示相当的性能。有趣的是,定性分析显示,SD特征与现有表示学习特征(如近期发布的DINOv2)具有截然不同的特性:DINOv2提供稀疏但准确的匹配,而SD特征则提供高质量的空间信息,但有时会出现不准确的语义匹配。我们证明,这两种特征的简单融合效果出人意料地好,并在基准数据集(如SPair-71k、PF-Pascal和TSS)上,基于融合特征的最近邻零样本评估相较现有最优方法实现了显著的性能提升。我们还展示了这些对应关系可支持有趣的应用,例如两幅图像中的实例交换。