We present an automated technique for computing a map between two genus-zero shapes, which matches semantically corresponding regions to one another. Lack of annotated data prohibits direct inference of 3D semantic priors; instead, current State-of-the-art methods predominantly optimize geometric properties or require varying amounts of manual annotation. To overcome the lack of annotated training data, we distill semantic matches from pre-trained vision models: our method renders the pair of 3D shapes from multiple viewpoints; the resulting renders are then fed into an off-the-shelf image-matching method which leverages a pretrained visual model to produce feature points. This yields semantic correspondences, which can be projected back to the 3D shapes, producing a raw matching that is inaccurate and inconsistent between different viewpoints. These correspondences are refined and distilled into an inter-surface map by a dedicated optimization scheme, which promotes bijectivity and continuity of the output map. We illustrate that our approach can generate semantic surface-to-surface maps, eliminating manual annotations or any 3D training data requirement. Furthermore, it proves effective in scenarios with high semantic complexity, where objects are non-isometrically related, as well as in situations where they are nearly isometric.
翻译:我们提出了一种自动化技术,用于计算两个零亏格形状之间的映射,该映射能够使语义上对应的区域相互匹配。注释数据的缺乏阻碍了三维语义先验的直接推理;相反,当前最先进的方法主要优化几何属性或需要不同数量的人工注释。为了克服训练数据注释不足的问题,我们从预训练的视觉模型中提取语义匹配:我们的方法从多个视角渲染这对三维形状;然后,将生成的渲染图像输入到现成的图像匹配方法中,该方法利用预训练的视觉模型生成特征点。这产生了语义对应关系,可以投影回三维形状,从而产生在不同视角之间不准确且不一致的原始匹配。这些对应关系通过专门的优化方案进行精炼并提取到表面间映射中,该方案促进了输出映射的双射性和连续性。我们证明了我们的方法能够生成语义表面到表面的映射,消除了人工注释或任何三维训练数据的需要。此外,它在语义复杂度高、对象非等距相关的场景以及几乎等距的情况下均表现出了有效性。