In this paper, we propose Flash3D, a method for scene reconstruction and novel view synthesis from a single image which is both very generalisable and efficient. For generalisability, we start from a "foundation" model for monocular depth estimation and extend it to a full 3D shape and appearance reconstructor. For efficiency, we base this extension on feed-forward Gaussian Splatting. Specifically, we predict a first layer of 3D Gaussians at the predicted depth, and then add additional layers of Gaussians that are offset in space, allowing the model to complete the reconstruction behind occlusions and truncations. Flash3D is very efficient, trainable on a single GPU in a day, and thus accessible to most researchers. It achieves state-of-the-art results when trained and tested on RealEstate10k. When transferred to unseen datasets like NYU it outperforms competitors by a large margin. More impressively, when transferred to KITTI, Flash3D achieves better PSNR than methods trained specifically on that dataset. In some instances, it even outperforms recent methods that use multiple views as input. Code, models, demo, and more results are available at https://www.robots.ox.ac.uk/~vgg/research/flash3d/.
翻译:本文提出Flash3D,一种从单图像进行场景重建与新视角合成的既高度可泛化又高效的方法。为实现可泛化性,我们从单目深度估计的“基础”模型出发,将其扩展为完整的三维形状与外观重建器。为实现高效性,此扩展基于前馈高斯溅射技术。具体而言,我们首先在预测深度处生成第一层三维高斯分布,随后叠加在空间上偏移的附加高斯层,使模型能够补全遮挡与截断区域后的重建部分。Flash3D效率极高,可在单GPU上一天内完成训练,因而对大多数研究者而言易于使用。在RealEstate10k数据集上训练与测试时,该方法取得了最先进的结果。迁移至NYU等未见数据集时,其性能大幅超越同类方法。更引人注目的是,在迁移至KITTI数据集时,Flash3D获得的PSNR指标优于专门在该数据集上训练的方法。在某些情况下,其性能甚至超越近期采用多视图输入的方法。代码、模型、演示及更多结果详见https://www.robots.ox.ac.uk/~vgg/research/flash3d/。