Long-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences. They typically anchor poses to the first frame, which leads to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we solve Transformer cache issues such as attention-sink reliance and long-term KV-cache contamination. We propose cache-consistent training combined with periodic cache refresh. This approach suppresses attention degradation over ultra-long sequences and reduces the gap between training and inference. Experiments show LongStream achieves state-of-the-art performance. It delivers stable, metric-scale reconstruction over kilometer-scale sequences at 18 FPS. Project Page: https://3dagentworld.github.io/longstream/
翻译:长序列流式三维重建仍是一个重要的开放挑战。现有自回归模型在处理长序列时往往失效。它们通常将位姿锚定于首帧,这会导致注意力衰减、尺度漂移和外推误差。本文提出LongStream,一种新颖的规范解耦流式视觉几何模型,用于实现数千帧跨度的度量尺度场景重建。我们的方法包含三个核心创新:首先,我们摒弃首帧锚定机制,转而预测关键帧相对位姿。这将长距离外推问题重构为恒定难度的局部任务。其次,我们引入正交尺度学习方法。该技术将几何估计与尺度估计完全解耦以抑制漂移。最后,我们解决了Transformer缓存的关键问题,包括注意力沉没依赖和长期KV缓存污染。我们提出缓存一致性训练与周期性缓存刷新相结合的策略。该方法有效抑制了超长序列中的注意力退化,并缩小了训练与推理间的差距。实验表明LongStream实现了最先进的性能,能够在公里级序列上以18 FPS的速度提供稳定、度量尺度的三维重建。项目页面:https://3dagentworld.github.io/longstream/