Despite recent advancements in the Large Reconstruction Model (LRM) demonstrating impressive results, when extending its input from single image to multiple images, it exhibits inefficiencies, subpar geometric and texture quality, as well as slower convergence speed than expected. It is attributed to that, LRM formulates 3D reconstruction as a naive images-to-3D translation problem, ignoring the strong 3D coherence among the input images. In this paper, we propose a Multi-view Large Reconstruction Model (M-LRM) designed to reconstruct high-quality 3D shapes from multi-views in a 3D-aware manner. Specifically, we introduce a multi-view consistent cross-attention scheme to enable M-LRM to accurately query information from the input images. Moreover, we employ the 3D priors of the input multi-view images to initialize the triplane tokens. Compared to previous methods, the proposed M-LRM can generate 3D shapes of high fidelity. Experimental studies demonstrate that our model achieves a significant performance gain and faster training convergence. Project page: \url{https://murphylmf.github.io/M-LRM/}.
翻译:尽管近期大型重建模型(LRM)的研究进展已展现出令人瞩目的成果,但当将其输入从单张图像扩展至多张图像时,该模型仍存在效率低下、几何与纹理质量欠佳以及收敛速度慢于预期等问题。这归因于LRM将三维重建简单地表述为图像到三维的转换问题,而忽略了输入图像之间强烈的三维一致性。本文提出了一种多视角大型重建模型(M-LRM),旨在以三维感知的方式从多视角图像重建高质量的三维形状。具体而言,我们引入了一种多视角一致交叉注意力机制,使M-LRM能够从输入图像中准确查询信息。此外,我们利用输入多视角图像的三维先验来初始化三平面令牌。与先前方法相比,所提出的M-LRM能够生成高保真度的三维形状。实验研究表明,我们的模型实现了显著的性能提升和更快的训练收敛速度。项目页面:\url{https://murphylmf.github.io/M-LRM/}。