We tackle the challenge of efficiently reconstructing a 3D asset from a single image with growing demands for automated 3D content creation pipelines. Previous methods primarily rely on Score Distillation Sampling (SDS) and Neural Radiance Fields (NeRF). Despite their significant success, these approaches encounter practical limitations due to lengthy optimization and considerable memory usage. In this report, we introduce Gamba, an end-to-end amortized 3D reconstruction model from single-view images, emphasizing two main insights: (1) 3D representation: leveraging a large number of 3D Gaussians for an efficient 3D Gaussian splatting process; (2) Backbone design: introducing a Mamba-based sequential network that facilitates context-dependent reasoning and linear scalability with the sequence (token) length, accommodating a substantial number of Gaussians. Gamba incorporates significant advancements in data preprocessing, regularization design, and training methodologies. We assessed Gamba against existing optimization-based and feed-forward 3D generation approaches using the real-world scanned OmniObject3D dataset. Here, Gamba demonstrates competitive generation capabilities, both qualitatively and quantitatively, while achieving remarkable speed, approximately 0.6 second on a single NVIDIA A100 GPU.
翻译:我们针对从单张图像高效重建三维资产的挑战展开研究,该需求源于自动化三维内容创建流程的日益增长。以往方法主要依赖得分蒸馏采样(SDS)和神经辐射场(NeRF)。尽管这些方法取得了显著成功,但因耗时的优化过程和大量内存占用而面临实际应用限制。本报告中,我们提出Gamba——一种基于单视图图像的端到端摊销式三维重建模型,核心要点包括:(1) 三维表征:利用大量三维高斯体实现高效的三维高斯泼溅(3D Gaussian Splatting);(2) 主干网络设计:引入基于Mamba的序列网络,支持上下文依赖推理并随序列(令牌)长度线性扩展,可容纳大量高斯体。Gamba在数据预处理、正则化设计及训练方法上均取得显著进展。我们利用真实扫描的OmniObject3D数据集,将Gamba与现有基于优化和前馈的三维生成方法进行对比评估。结果表明,Gamba在定性与定量层面均展现出具有竞争力的生成能力,同时在单个NVIDIA A100 GPU上实现了约0.6秒的惊人速度。