Streaming 3D perception is well suited to robotics and augmented reality, where long visual streams must be processed efficiently and consistently. Recent recurrent models offer a promising solution by maintaining fixed-size states and enabling linear-time inference, but they often suffer from drift accumulation and temporal forgetting over long sequences due to the limited capacity of compressed latent memories. We propose Mem3R, a streaming 3D reconstruction model with a hybrid memory design that decouples camera tracking from geometric mapping to improve temporal consistency over long sequences. For camera tracking, Mem3R employs an implicit fast-weight memory implemented as a lightweight Multi-Layer Perceptron updated via Test-Time Training. For geometric mapping, Mem3R maintains an explicit token-based fixed-size state. Compared with CUT3R, this design not only significantly improves long-sequence performance but also reduces the model size from 793M to 644M parameters. Mem3R supports existing improved plug-and-play state update strategies developed for CUT3R. Specifically, integrating it with TTT3R decreases Absolute Trajectory Error by up to 39% over the base implementation on 500 to 1000 frame sequences. The resulting improvements also extend to other downstream tasks, including video depth estimation and 3D reconstruction, while preserving constant GPU memory usage and comparable inference throughput. Project page: https://lck666666.github.io/Mem3R/
翻译:流式三维感知非常适合需要高效且一致处理长视觉流数据的机器人技术和增强现实场景。当前递归模型通过维护固定大小的状态并实现线性时间推理提供了有前景的解决方案,但由于压缩潜在记忆容量有限,这些模型在长序列处理中常面临误差累积和时间遗忘问题。我们提出Mem3R——一种采用混合记忆设计的流式三维重建模型,该模型将相机追踪与几何映射解耦,从而提升长序列的时间一致性。在相机追踪方面,Mem3R通过测试时训练更新轻量级多层感知机,实现隐式快速权重记忆;在几何映射方面,其维护基于显式标记的固定大小状态。与CUT3R相比,该设计不仅显著提升了长序列性能,还将模型参数量从7.93亿降至6.44亿。Mem3R兼容现有为CUT3R开发的即插即用状态更新策略。具体而言,在500至1000帧序列上集成TTT3R后,基础实现的绝对轨迹误差最高降低39%。此改进还可拓展至视频深度估计和三维重建等其他下游任务,同时保持恒定GPU内存占用和相当推理吞吐量。项目页面:https://lck666666.github.io/Mem3R/