Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code available in https://rover-xingyu.github.io/TTT3R
翻译:现代循环神经网络因其线性时间复杂度,已成为三维重建领域一种具有竞争力的架构。然而,当应用于超出训练上下文长度时,其性能会显著下降,显示出有限的长度泛化能力。在本工作中,我们从测试时训练的角度重新审视三维重建基础模型,将其设计框架为一个在线学习问题。基于这一视角,我们利用记忆状态与传入观测之间的对齐置信度,推导出记忆更新的闭式学习率,以在保留历史信息与适应新观测之间取得平衡。这种无需训练干预的方法,称为TTT3R,显著提升了长度泛化能力,在全局位姿估计上相比基线实现了$2\times$的改进,同时以20 FPS的速度运行,仅需6 GB GPU内存即可处理数千张图像。代码可在 https://rover-xingyu.github.io/TTT3R 获取。