Recent advances in generalizable 3D Gaussian Splatting (3DGS) have enabled rapid 3D scene reconstruction within seconds, eliminating the need for per-scene optimization. However, existing methods primarily follow an offline reconstruction paradigm, lacking the capacity for continuous reconstruction, which limits their applicability to online scenarios such as robotics and VR/AR. In this paper, we introduce OnlineX, a feed-forward framework that reconstructs both 3D visual appearance and language fields in an online manner using only streaming images. A key challenge in online formulation is the cumulative drift issue, which is rooted in the fundamental conflict between two opposing roles of the memory state: an active role that constantly refreshes to capture high-frequency local geometry, and a stable role that conservatively accumulates and preserves the long-term global structure. To address this, we introduce a decoupled active-to-stable state evolution paradigm. Our framework decouples the memory state into a dedicated active state and a persistent stable state, and then cohesively fuses the information from the former into the latter to achieve both fidelity and stability. Moreover, we jointly model visual appearance and language fields and incorporate an implicit Gaussian fusion module to enhance reconstruction quality. Experiments on mainstream datasets demonstrate that our method consistently outperforms prior work in novel view synthesis and semantic understanding, showcasing robust performance across input sequences of varying lengths with real-time inference speed.
翻译:近年来,通用化三维高斯泼溅(3DGS)技术的进展使得三维场景重建可在数秒内完成,无需进行逐场景优化。然而,现有方法主要遵循离线重建范式,缺乏连续重建能力,这限制了其在机器人、VR/AR等在线场景中的应用。本文提出OnlineX,一种仅利用流式图像即可在线重建三维视觉外观与语言场的前馈框架。在线建模的一个关键挑战是累积漂移问题,其根源在于记忆状态两种对立角色间的根本冲突:一方面需要作为主动角色不断更新以捕捉高频局部几何,另一方面又需作为稳定角色保守地积累并保持长期全局结构。为解决此问题,我们提出解耦的主动-稳定状态演化范式。该框架将记忆状态解耦为专用的主动状态与持久的稳定状态,并将前者的信息协同融合至后者,从而实现重建的精确性与稳定性。此外,我们联合建模视觉外观与语言场,并引入隐式高斯融合模块以提升重建质量。在主流数据集上的实验表明,本方法在新视角合成与语义理解任务中持续优于现有方法,在不同长度的输入序列上均展现出鲁棒性能,并具备实时推理速度。