We introduce a hierarchical probabilistic approach to go from a 2D image to multiview 3D: a diffusion "prior" models the unseen 3D geometry, which then conditions a diffusion "decoder" to generate novel views of the subject. We use a pointmap-based geometric representation in a multiview image format to coordinate the generation of multiple target views simultaneously. We facilitate correspondence between views by assuming fixed target camera poses relative to the source camera, and constructing a predictable distribution of geometric features per target. Our modular, geometry-driven approach to novel-view synthesis (called "unPIC") beats SoTA baselines such as CAT3D and One-2-3-45 on held-out objects from ObjaverseXL, as well as real-world objects ranging from Google Scanned Objects, Amazon Berkeley Objects, to the Digital Twin Catalog.
翻译:我们提出了一种从二维图像到多视角三维的分层概率方法:一个扩散"先验"模型对不可见的三维几何进行建模,随后该模型作为条件输入到一个扩散"解码器"中,以生成目标的新视角图像。我们采用基于点图的多视角图像格式几何表示,以协调多个目标视角的同步生成。我们通过假设目标相机姿态相对于源相机固定,并为每个目标构建可预测的几何特征分布,来促进不同视角间的对应关系。我们这种模块化、几何驱动的新视角合成方法(称为"unPIC")在ObjaverseXL数据集的保留对象,以及从Google扫描对象、Amazon Berkeley对象到数字孪生目录的真实世界对象上,均超越了如CAT3D和One-2-3-45等当前最先进的基线方法。