Estimating 3D articulated shapes like animal bodies from monocular images is inherently challenging due to the ambiguities of camera viewpoint, pose, texture, lighting, etc. We propose ARTIC3D, a self-supervised framework to reconstruct per-instance 3D shapes from a sparse image collection in-the-wild. Specifically, ARTIC3D is built upon a skeleton-based surface representation and is further guided by 2D diffusion priors from Stable Diffusion. First, we enhance the input images with occlusions/truncation via 2D diffusion to obtain cleaner mask estimates and semantic features. Second, we perform diffusion-guided 3D optimization to estimate shape and texture that are of high-fidelity and faithful to input images. We also propose a novel technique to calculate more stable image-level gradients via diffusion models compared to existing alternatives. Finally, we produce realistic animations by fine-tuning the rendered shape and texture under rigid part transformations. Extensive evaluations on multiple existing datasets as well as newly introduced noisy web image collections with occlusions and truncation demonstrate that ARTIC3D outputs are more robust to noisy images, higher quality in terms of shape and texture details, and more realistic when animated. Project page: https://chhankyao.github.io/artic3d/
翻译:从单目图像估计动物身体等可动三维形状本质上是困难的,因为存在相机视角、姿态、纹理、光照等多重歧义。我们提出ARTIC3D,这是一个自监督框架,能从野外稀疏图像集合重建每个实例的三维形状。具体而言,ARTIC3D基于骨架表面表示,并进一步由Stable Diffusion的二维扩散先验引导。首先,我们通过二维扩散增强输入图像中的遮挡/截断,以获得更干净的掩膜估计和语义特征。其次,我们进行扩散引导的三维优化,以估计高保真且忠实于输入图像的形状和纹理。我们还提出了一种新技术,通过扩散模型计算比现有替代方案更稳定的图像级梯度。最后,我们通过微调刚性部件变换下的渲染形状和纹理,生成逼真的动画。在多个现有数据集以及新引入的带有遮挡和截断的嘈杂网络图像集合上的广泛评估表明,ARTIC3D输出对噪声图像更鲁棒,在形状和纹理细节方面质量更高,并且在动画时更逼真。项目页面:https://chhankyao.github.io/artic3d/