Given sparse views of an object, estimating their camera poses is a long-standing and intractable problem. We harness the pre-trained diffusion model of novel views conditioned on viewpoints (Zero-1-to-3). We present ID-Pose which inverses the denoising diffusion process to estimate the relative pose given two input images. ID-Pose adds a noise on one image, and predicts the noise conditioned on the other image and a decision variable for the pose. The prediction error is used as the objective to find the optimal pose with the gradient descent method. ID-Pose can handle more than two images and estimate each of the poses with multiple image pairs from triangular relationships. ID-Pose requires no training and generalizes to real-world images. We conduct experiments using high-quality real-scanned 3D objects, where ID-Pose significantly outperforms state-of-the-art methods.
翻译:给定物体的稀疏视角图像,估计其相机姿态是一个长期存在且难以解决的问题。我们利用基于视点条件的新型视图预训练扩散模型(Zero-1-to-3)。本文提出ID-Pose,该方法通过反演去噪扩散过程,从两张输入图像中估计相对姿态。ID-Pose对一张图像添加噪声,并基于另一张图像及一个用于姿态的决策变量来预测噪声。该预测误差被用作目标函数,通过梯度下降法寻找最优姿态。ID-Pose可处理多于两张图像,并通过三角关系利用多个图像对估计每张图像的姿态。该方法无需训练,且可泛化至真实世界图像。我们采用高质量实景扫描的3D物体进行实验,ID-Pose显著优于现有最先进方法。