Online 3D reconstruction requires estimating camera pose and scene geometry under strict causal and bounded-memory constraints. Existing methods often suffer from drift, jitter, or collapse on long sequences. We trace these failures to a fundamental mismatch. Streaming geometry is inherently temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale. However, current architectures impose uniform and pathological influence patterns. For example, sliding windows enforce hard cutoffs, while ungated recurrence and causal attention cause cache saturation and spike-like attention sinks. To resolve this, we formalize geometric propagation as an \emph{evidence influence kernel} and propose HorizonStream, a long-horizon Transformer that explicitly factorizes this kernel. For the long-range temporal factor, Geometric Linear Attention learns channel-wise decay rates to enable bounded, multi-timescale propagation of geometric evidence. For the short-range spatial factor, Geometric Local Attention with Spatiotemporal RoPE performs reliable 3D matching while suppressing attention sinks. Finally, Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000\ frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance. Project Page: https://3dagentworld.github.io/horizonstream/
翻译:在线三维重建要求在严格的因果约束和有限内存约束下估计相机位姿与场景几何。现有方法在长序列上常出现漂移、抖动或崩溃。我们追溯这些失败源于根本性失配:流式几何本质上具有时域异质性——证据涵盖从短时对应关系到持久全局尺度。然而当前架构施加了统一且病态的影响模式,例如滑动窗口强制硬截断,无门控递归与因果注意力导致缓存饱和及尖峰式注意力汇聚点。为解决此问题,我们将几何传播形式化为"证据影响核",并提出HorizonStream——一种显式分解该核的长时域Transformer。针对长程时域因子,几何线性注意力学习通道级衰减率,实现有界、多时间尺度的几何证据传播;针对短程空域因子,结合时空旋转位置编码的几何局部注意力在抑制注意力汇聚点的同时执行可靠三维匹配。最终,度量读出标记直接从持久化几何状态恢复稳定尺度与刚性位姿。大量实验表明:仅用48帧片段训练的HorizonStream,能在恒等内存与线性时间内稳定泛化至超10000帧序列,实现流式三维重建的当前最优性能。项目页面:https://3dagentworld.github.io/horizonstream/