Finding correspondences between images is a fundamental problem in computer vision. In this paper, we show that correspondence emerges in image diffusion models without any explicit supervision. We propose a simple strategy to extract this implicit knowledge out of diffusion networks as image features, namely DIffusion FeaTures (DIFT), and use them to establish correspondences between real images. Without any additional fine-tuning or supervision on the task-specific data or annotations, DIFT is able to outperform both weakly-supervised methods and competitive off-the-shelf features in identifying semantic, geometric, and temporal correspondences. Particularly for semantic correspondence, DIFT from Stable Diffusion is able to outperform DINO and OpenCLIP by 19 and 14 accuracy points respectively on the challenging SPair-71k benchmark. It even outperforms the state-of-the-art supervised methods on 9 out of 18 categories while remaining on par for the overall performance. Project page: https://diffusionfeatures.github.io
翻译:寻找图像之间的对应关系是计算机视觉中的一个基本问题。本文表明,无需任何显式监督,图像扩散模型即可涌现出对应关系。我们提出一种简单策略,将扩散网络中蕴含的隐式知识提取为图像特征,即扩散特征(DIffusion FeaTures, DIFT),并利用这些特征建立真实图像之间的对应关系。无需针对特定任务的数据或标注进行任何额外微调或监督,DIFT在识别语义、几何和时间对应关系方面,能够超越弱监督方法及具有竞争力的现成特征。特别是在语义对应任务中,来自Stable Diffusion的DIFT在具有挑战性的SPair-71k基准上,分别以19和14个精度点超越DINO和OpenCLIP。在18个类别中的9个类别上,DIFT甚至超越了当前最先进的监督方法,同时在整体性能上保持持平。项目页面:https://diffusionfeatures.github.io