We present Neural Congealing -- a zero-shot self-supervised framework for detecting and jointly aligning semantically-common content across a given set of images. Our approach harnesses the power of pre-trained DINO-ViT features to learn: (i) a joint semantic atlas -- a 2D grid that captures the mode of DINO-ViT features in the input set, and (ii) dense mappings from the unified atlas to each of the input images. We derive a new robust self-supervised framework that optimizes the atlas representation and mappings per image set, requiring only a few real-world images as input without any additional input information (e.g., segmentation masks). Notably, we design our losses and training paradigm to account only for the shared content under severe variations in appearance, pose, background clutter or other distracting objects. We demonstrate results on a plethora of challenging image sets including sets of mixed domains (e.g., aligning images depicting sculpture and artwork of cats), sets depicting related yet different object categories (e.g., dogs and tigers), or domains for which large-scale training data is scarce (e.g., coffee mugs). We thoroughly evaluate our method and show that our test-time optimization approach performs favorably compared to a state-of-the-art method that requires extensive training on large-scale datasets.
翻译:我们提出神经凝固(Neural Congealing)——一种零样本自监督框架,用于检测并联合对齐给定图像集中语义共同的内容。我们的方法利用预训练DINO-ViT特征的能力,学习:(i) 联合语义图谱——一个捕获输入集中DINO-ViT特征模式的二维网格,以及(ii) 从统一图谱到每张输入图像的密集映射。我们推导出一种新的鲁棒自监督框架,该框架针对每个图像集优化图谱表示和映射,仅需少量真实世界图像作为输入,无需任何额外输入信息(如分割掩码)。值得注意的是,我们设计的损失函数和训练范式仅考虑外观、姿态、背景杂乱或其他干扰物体存在严重变化下的共享内容。我们在众多具有挑战性的图像集上展示了结果,包括混合领域图像集(例如,对齐描绘猫雕塑和艺术品的图像)、描述相关但不同物体类别的图像集(例如,狗和老虎),以及大规模训练数据稀缺的领域(例如,咖啡杯)。我们全面评估了我们的方法,并表明,与需要在大型数据集上进行广泛训练的最先进方法相比,我们的测试时优化方法表现更优。