We propose a Pose-Free Large Reconstruction Model (PF-LRM) for reconstructing a 3D object from a few unposed images even with little visual overlap, while simultaneously estimating the relative camera poses in ~1.3 seconds on a single A100 GPU. PF-LRM is a highly scalable method utilizing the self-attention blocks to exchange information between 3D object tokens and 2D image tokens; we predict a coarse point cloud for each view, and then use a differentiable Perspective-n-Point (PnP) solver to obtain camera poses. When trained on a huge amount of multi-view posed data of ~1M objects, PF-LRM shows strong cross-dataset generalization ability, and outperforms baseline methods by a large margin in terms of pose prediction accuracy and 3D reconstruction quality on various unseen evaluation datasets. We also demonstrate our model's applicability in downstream text/image-to-3D task with fast feed-forward inference. Our project website is at: https://totoro97.github.io/pf-lrm .
翻译:我们提出了一种无位姿输入的大规模重建模型(PF-LRM),能够从少量甚至视觉重叠极少的无位姿图像中重建三维物体,同时在单块A100 GPU上以约1.3秒的速度估计相对相机姿态。PF-LRM是一种高度可扩展的方法,利用自注意力模块在三维物体令牌与二维图像令牌之间交换信息;我们为每个视图预测一个粗糙点云,然后通过可微分的透视n点(PnP)求解器获得相机姿态。在约100万物体的海量多视点有标注数据上训练后,PF-LRM展现出强大的跨数据集泛化能力,并在多个未见过的评估数据集上,在姿态预测精度和三维重建质量方面大幅领先基准方法。我们还展示了模型在下游文本/图像到三维任务中的适用性,支持快速前馈推理。项目网站:https://totoro97.github.io/pf-lrm。