We propose Long-LRM, a generalizable 3D Gaussian reconstruction model that is capable of reconstructing a large scene from a long sequence of input images. Specifically, our model can process 32 source images at 960x540 resolution within only 1.3 seconds on a single A100 80G GPU. Our architecture features a mixture of the recent Mamba2 blocks and the classical transformer blocks which allowed many more tokens to be processed than prior work, enhanced by efficient token merging and Gaussian pruning steps that balance between quality and efficiency. Unlike previous feed-forward models that are limited to processing 1~4 input images and can only reconstruct a small portion of a large scene, Long-LRM reconstructs the entire scene in a single feed-forward step. On large-scale scene datasets such as DL3DV-140 and Tanks and Temples, our method achieves performance comparable to optimization-based approaches while being two orders of magnitude more efficient. Project page: https://arthurhero.github.io/projects/llrm
翻译:我们提出Long-LRM,一种可泛化的3D高斯重建模型,能够从长序列输入图像中重建大规模场景。具体而言,我们的模型在单块A100 80G GPU上仅需1.3秒即可处理32张分辨率为960x540的源图像。我们的架构融合了最新的Mamba2模块与经典的transformer模块,这使得模型能够处理比先前工作更多的token,并通过高效的token合并与高斯剪枝步骤在质量与效率之间取得平衡。与以往仅能处理1~4张输入图像且只能重建大规模场景中一小部分的纯前馈模型不同,Long-LRM通过单次前馈步骤即可重建整个场景。在DL3DV-140和Tanks and Temples等大规模场景数据集上,我们的方法取得了与基于优化的方法相当的性能,同时效率提高了两个数量级。项目页面:https://arthurhero.github.io/projects/llrm