Semantic correspondence made tremendous progress through the recent advancements of large vision models (LVM). While these LVMs have been shown to reliably capture local semantics, the same can currently not be said for capturing global geometric relationships between semantic object regions. This problem leads to unreliable performance for semantic correspondence between images with extreme view variation. In this work, we aim to leverage monocular depth estimates to capture these geometric relationships for more robust and data-efficient semantic correspondence. First, we introduce a simple but effective method to build 3D object-class representations from monocular depth estimates and LVM features using a sparsely annotated image correspondence dataset. Second, we formulate an alignment energy that can be minimized using gradient descent to obtain an alignment between the 3D object-class representation and the object-class instance in the input RGB-image. Our method achieves state-of-the-art matching accuracy in multiple categories on the challenging SPair-71k dataset, increasing the PCK@0.1 score by more than 10 points on three categories and overall by 3.3 points from 85.6% to 88.9%. Additional resources and code are available at https://dub.sh/semalign3d.
翻译:语义对应任务借助大型视觉模型(LVM)的最新进展取得了巨大进步。虽然这些LVM已被证明能够可靠地捕捉局部语义,但目前尚不能断言其同样能有效捕捉语义物体区域间的全局几何关系。该问题导致在视角变化剧烈的图像之间进行语义对应时,性能表现不可靠。在本工作中,我们旨在利用单目深度估计来捕捉这些几何关系,以实现更鲁棒且数据高效的语义对应。首先,我们提出了一种简单而有效的方法,利用稀疏标注的图像对应数据集,结合单目深度估计与LVM特征,构建三维物体类别表征。其次,我们构建了一种对齐能量函数,可通过梯度下降法进行最小化,从而获得三维物体类别表征与输入RGB图像中物体类别实例之间的对齐关系。我们的方法在具有挑战性的SPair-71k数据集上的多个类别中实现了最先进的匹配精度,在三个类别上将PCK@0.1分数提升了超过10个百分点,总体分数从85.6%提升至88.9%,提高了3.3个百分点。更多资源与代码可在 https://dub.sh/semalign3d 获取。