We present Fillerbuster, a method that completes unknown regions of a 3D scene by utilizing a novel large-scale multi-view latent diffusion transformer. Casual captures are often sparse and miss surrounding content behind objects or above the scene. Existing methods are not suitable for handling this challenge as they focus on making the known pixels look good with sparse-view priors, or on creating the missing sides of objects from just one or two photos. In reality, we often have hundreds of input frames and want to complete areas that are missing and unobserved from the input frames. Additionally, the images often do not have known camera parameters. Our solution is to train a generative model that can consume a large context of input frames while generating unknown target views and recovering image poses when desired. We show results where we complete partial captures on two existing datasets. We also present an uncalibrated scene completion task where our unified model predicts both poses and creates new content. Our model is the first to predict many images and poses together for scene completion.
翻译:我们提出Fillerbuster方法,通过利用一种新颖的大规模多视角潜在扩散Transformer来完成三维场景的未知区域。随意拍摄的影像通常较为稀疏,会遗漏物体后方或场景上方的周边内容。现有方法不适用于应对这一挑战,因为它们主要侧重于利用稀疏视角先验使已知像素呈现良好效果,或仅从一两张照片中生成物体的缺失侧面。实际上,我们通常拥有数百张输入帧,并需要补全输入帧中缺失且未被观测到的区域。此外,这些图像通常不具备已知相机参数。我们的解决方案是训练一个生成模型,该模型能够处理大量输入帧的上下文信息,同时生成未知目标视角,并在需要时恢复图像位姿。我们在两个现有数据集上展示了补全局部拍摄内容的结果。我们还提出了一个未标定场景补全任务,其中我们的统一模型可同时预测位姿并创建新内容。我们的模型是首个能够联合预测多张图像与位姿以实现场景补全的模型。