We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.
翻译:本文提出了一种可扩展的三维重建模型,旨在解决离线前馈方法的一个关键局限:其计算与内存需求随输入图像数量呈二次方增长。我们的方法基于一个关键洞见,即这一瓶颈源于场景几何的变长键值(KV)空间表示。我们通过测试时训练将其蒸馏为一个固定大小的多层感知机(MLP)。VGG-T$^3$(视觉几何基础测试时训练)模型的计算复杂度随输入视角数量线性增长,与在线模型类似,仅需$54$秒即可重建包含$1k$张图像的集合,相比依赖softmax注意力的基线方法实现了$11.6$倍的加速。由于本方法保留了全局场景聚合能力,其点云重建误差显著优于其他线性时间复杂度方法。最后,我们通过使用未见图像查询场景表示,展示了模型的视觉定位能力。