We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model's capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforward 3D Gaussian reconstruction compared to state-of-the-art approaches on both objects and scenes.
翻译:我们提出tttLRM,一种新颖的大规模三维重建模型,通过引入测试时训练层实现具有线性计算复杂度的长上下文自回归三维重建,从而进一步扩展了模型能力。该框架能够将多幅图像观测高效压缩至TTT层的快速权重中,在隐空间中形成隐式三维表示,并可解码为多种显式格式(例如用于下游应用的高斯泼溅表示)。模型的在线学习变体支持从流式观测中实现渐进式三维重建与精细化。我们证明,在新视角合成任务上的预训练能有效迁移至显式三维建模,从而提升重建质量并加速收敛。大量实验表明,在物体与场景的三维高斯前馈重建任务中,我们的方法相比现有最优方法取得了更优越的性能。