Long-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences because they anchor poses to the first frame, leading to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames under a strictly online, future-invisible setting. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we identify attention bias issues in Transformers, including attention-sink reliance and long-term KV-cache saturation. We propose cache-consistent training combined with periodic cache refresh. This approach suppresses attention biases and contamination over ultra-long sequences and reduces the gap between training and inference. Experiments show that LongStream achieves state-of-the-art performance, enabling stable, metric-scale reconstruction over kilometer-scale sequences at 18 FPS. Project Page: https://3dagentworld.github.io/longstream/
翻译:长序列流式三维重建仍是一个重要的开放挑战。现有自回归模型在处理长序列时常常失效,因为它们将位姿锚定于首帧,导致注意力衰减、尺度漂移和外推误差。我们提出了LongStream,一种新颖的规范解耦流式视觉几何模型,用于在严格在线、未来不可见的设置下,跨越数千帧实现公制尺度的场景重建。我们的方法包含三个方面。首先,我们摒弃首帧锚定,预测关键帧相对位姿。这将长距离外推重新表述为一个恒定难度的局部任务。其次,我们引入正交尺度学习。该方法将几何与尺度估计完全解耦以抑制漂移。最后,我们识别了Transformer中的注意力偏差问题,包括对注意力汇的依赖和长期KV缓存饱和。我们提出了缓存一致性训练结合周期性缓存刷新。该方法抑制了超长序列上的注意力偏差与污染,并缩小了训练与推理之间的差距。实验表明,LongStream实现了最先进的性能,能够以18 FPS的速度在公里级序列上实现稳定、公制尺度的重建。项目页面:https://3dagentworld.github.io/longstream/