With the immense growth of dataset sizes and computing resources in recent years, so-called foundation models have become popular in NLP and vision tasks. In this work, we propose to explore foundation models for the task of keypoint detection on 3D shapes. A unique characteristic of keypoint detection is that it requires semantic and geometric awareness while demanding high localization accuracy. To address this problem, we propose, first, to back-project features from large pre-trained 2D vision models onto 3D shapes and employ them for this task. We show that we obtain robust 3D features that contain rich semantic information and analyze multiple candidate features stemming from different 2D foundation models. Second, we employ a keypoint candidate optimization module which aims to match the average observed distribution of keypoints on the shape and is guided by the back-projected features. The resulting approach achieves a new state of the art for few-shot keypoint detection on the KeyPointNet dataset, almost doubling the performance of the previous best methods.
翻译:近年来,随着数据集规模和计算资源的急剧增长,所谓的“基础模型”在自然语言处理和视觉任务中变得流行。本文提出探索基础模型在三维形状关键点检测任务中的应用。关键点检测的一个独特特点是需要语义和几何感知能力,同时要求高定位精度。为解决这一问题,我们首先提出将大规模预训练二维视觉模型的特征反向投影到三维形状上,并用于该任务。我们证明,这种方法能获得富含丰富语义信息的鲁棒三维特征,并分析了源自不同二维基础模型的多个候选特征。其次,我们采用关键点候选优化模块,旨在匹配形状上关键点的平均观测分布,并以反向投影特征为指导。最终方法在KeyPointNet数据集上实现了少样本关键点检测的最新最优性能,性能几乎是先前最佳方法的两倍。