We present an automated technique for computing a map between two genus-zero shapes, which matches semantically corresponding regions to one another. Lack of annotated data prohibits direct inference of 3D semantic priors; instead, current State-of-the-art methods predominantly optimize geometric properties or require varying amounts of manual annotation. To overcome the lack of annotated training data, we distill semantic matches from pre-trained vision models: our method renders the pair of 3D shapes from multiple viewpoints; the resulting renders are then fed into an off-the-shelf image-matching method which leverages a pretrained visual model to produce feature points. This yields semantic correspondences, which can be projected back to the 3D shapes, producing a raw matching that is inaccurate and inconsistent between different viewpoints. These correspondences are refined and distilled into an inter-surface map by a dedicated optimization scheme, which promotes bijectivity and continuity of the output map. We illustrate that our approach can generate semantic surface-to-surface maps, eliminating manual annotations or any 3D training data requirement. Furthermore, it proves effective in scenarios with high semantic complexity, where objects are non-isometrically related, as well as in situations where they are nearly isometric.
翻译:我们提出一种自动计算两个零亏格形状间映射的技术,该映射能够匹配彼此间语义对应的区域。标注数据的匮乏阻碍了对三维语义先验的直接推断;相反,当前最先进的方法主要优化几何属性或需要不同数量的人工标注。为克服标注训练数据不足的问题,我们从预训练视觉模型中提炼语义匹配:我们的方法从多个视角渲染三维形状对;将渲染结果输入现成的图像匹配方法,该方法利用预训练视觉模型生成特征点。由此产生语义对应关系,可投影回三维形状,形成原始匹配——该匹配在不同视角间存在不准确和不一致。通过专门的优化方案,这些对应关系被精炼并蒸馏为曲面间映射,该方案促进输出映射的双射性和连续性。我们证明,该方法能够生成表面间语义映射,消除人工标注或任何三维训练数据需求。此外,在对象存在非等距关系的高语义复杂度场景以及近乎等距的场景下,该方法均展现出有效性。