In this work, we introduce the Geometry-Aware Large Reconstruction Model (GeoLRM), an approach which can predict high-quality assets with 512k Gaussians and 21 input images in only 11 GB GPU memory. Previous works neglect the inherent sparsity of 3D structure and do not utilize explicit geometric relationships between 3D and 2D images. This limits these methods to a low-resolution representation and makes it difficult to scale up to the dense views for better quality. GeoLRM tackles these issues by incorporating a novel 3D-aware transformer structure that directly processes 3D points and uses deformable cross-attention mechanisms to effectively integrate image features into 3D representations. We implement this solution through a two-stage pipeline: initially, a lightweight proposal network generates a sparse set of 3D anchor points from the posed image inputs; subsequently, a specialized reconstruction transformer refines the geometry and retrieves textural details. Extensive experimental results demonstrate that GeoLRM significantly outperforms existing models, especially for dense view inputs. We also demonstrate the practical applicability of our model with 3D generation tasks, showcasing its versatility and potential for broader adoption in real-world applications.
翻译:本研究提出了几何感知大规模重建模型(GeoLRM),该方法仅需11GB显存即可从21张输入图像预测具有512k个高斯单元的高质量三维资产。现有方法忽视了三维结构固有的稀疏性,且未充分利用三维与二维图像间的显式几何关联,这导致其仅能实现低分辨率表示,难以扩展至密集视角以提升重建质量。GeoLRM通过引入创新的三维感知Transformer架构解决上述问题:该架构直接处理三维点云数据,并采用可变形交叉注意力机制将图像特征高效整合至三维表示中。我们通过两阶段流程实现该方案:首先,轻量级提案网络从位姿已知的输入图像生成稀疏三维锚点集;随后,专用重建Transformer对几何结构进行细化并恢复纹理细节。大量实验结果表明,GeoLRM在密集视角输入条件下显著优于现有模型。我们进一步通过三维生成任务验证了模型的实际应用能力,展现了其在现实场景中广泛应用的潜力与通用性。