Most existing 3D keypoint estimation methods rely on manual annotations or calibrated multi-view images, both of which are expensive to collect. This paper introduces KeyDiff3D, a framework that can accurately predict 3D keypoints from a single image, thus eliminating the need for such expensive data acquisitions. To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model. In our framework, the diffusion model generates multi-view images from a single image, serving as supervision signals to provide 3D geometric cues to our model. We also introduce a 3D feature extractor that transforms implicit 3D priors embedded in the diffusion features into explicit 3D feature volumes. Beyond accurate keypoint estimation, we further introduce a pipeline that enables manipulation of 3D objects generated by the diffusion model. Experimental results on diverse datasets, including Human3.6M, CUB-200-2011, Stanford Dogs, and several in-the-wild and out-of-domain inputs, highlight the effectiveness of our method in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image.
翻译:现有的3D关键点估计方法大多依赖人工标注或校准后的多视角图像,这两类数据的采集成本高昂。本文提出KeyDiff3D框架,该框架可仅凭单张图像准确预测3D关键点,从而消除了对昂贵数据采集的需求。为实现此目标,我们利用了预训练多视角扩散模型中蕴含的强几何先验。在该框架中,扩散模型从单张图像生成多视角图像,为我们的模型提供3D几何线索作为监督信号。我们还引入了一个3D特征提取器,将扩散特征中隐含的隐式3D先验转化为显式3D特征体。除精确的关键点估计外,我们进一步提出了一条流水线,用于操控由扩散模型生成的3D对象。在包含Human3.6M、CUB-200-2011、Stanford Dogs等多个数据集以及若干野外和域外输入上的实验结果表明,我们的方法在精度、泛化能力以及从单张图像操控扩散模型生成的3D对象方面均展现了有效性。