Single-view 3D shape retrieval is a fundamental yet challenging task that is increasingly important with the growth of available 3D data. Existing approaches largely fall into two categories: those using contrastive learning to map point cloud features into existing vision-language spaces and those that learn a common embedding space for 2D images and 3D shapes. However, these feed-forward, holistic alignments are often difficult to interpret, which in turn limits their robustness and generalization to real-world applications. To address this problem, we propose Pose-Aware 3D Shape Retrieval (PASR), a framework that formulates retrieval as a feature-level analysis-by-synthesis problem by distilling knowledge from a 2D foundation model (DINOv3) into a 3D encoder. By aligning pose-conditioned 3D projections with 2D feature maps, our method bridges the gap between real-world images and synthetic meshes. During inference, PASR performs a test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the patch-level feature map of the input image. This synthesis-based optimization is inherently robust to partial occlusion and sensitive to fine-grained geometric details. PASR substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin. Additionally, PASR demonstrates strong multi-task capabilities, achieving robust shape retrieval, competitive pose estimation, and accurate category classification within a single framework.
翻译:摘要:单视图三维形状检索是一项基础且富有挑战性的任务,随着三维数据量的增长,其重要性日益凸显。现有方法主要分为两类:一类利用对比学习将点云特征映射到现有视觉-语言空间,另一类则学习二维图像与三维形状的共享嵌入空间。然而,这些前馈式的整体对齐方法通常难以解释,进而限制了其鲁棒性及在真实场景中的泛化能力。针对此问题,我们提出了位姿感知三维形状检索(PASR)框架——通过将二维基础模型(DINOv3)的知识蒸馏至三维编码器,将检索任务形式化为特征层面的分析-合成问题。通过将位姿条件化的三维投影与二维特征图对齐,我们的方法弥合了真实世界图像与合成网格之间的鸿沟。在推理阶段,PASR通过分析-合成进行测试时优化,联合搜索能最佳重构输入图像块级特征图的形状与位姿。这种基于合成的优化方法本质上对部分遮挡具有鲁棒性,且对细粒度几何细节敏感。在干净和遮挡的三维形状检索数据集上,PASR均以显著优势大幅超越现有方法。此外,PASR展现出强大的多任务能力,可在单一框架内实现鲁棒的形状检索、具有竞争力的位姿估计以及准确的类别分类。