3D Gaussian Splatting has emerged as a powerful paradigm for explicit 3D scene representation, yet achieving efficient and consistent 3D segmentation remains challenging. Existing segmentation approaches typically rely on high-dimensional feature lifting, which causes costly optimization, implicit semantics, and task-specific constraints. We present \textbf{Segment Any Gaussians Online (SAGOnline)}, a unified, zero-shot framework that achieves real-time, cross-view consistent segmentation without scene-specific training. SAGOnline decouples the monolithic segmentation problem into lightweight sub-tasks. By integrating video foundation models (e.g., SAM 2), we first generate temporally consistent 2D masks across rendered views. Crucially, instead of learning continuous feature fields, we introduce a \textbf{Rasterization-aware Geometric Consensus} mechanism that leverages the traceability of the Gaussian rasterization pipeline. This allows us to deterministically map 2D predictions to explicit, discrete 3D primitive labels in real-time. This discrete representation eliminates the memory and computational burden of feature distillation, enabling instant inference. Extensive evaluations on NVOS and SPIn-NeRF benchmarks demonstrate that SAGOnline achieves state-of-the-art accuracy (92.7\% and 95.2\% mIoU) while operating at the fastest speed at 27 ms per frame. By providing a flexible interface for diverse foundation models, our framework supports instant prompt, instance, and semantic segmentation, paving the way for interactive 3D understanding in AR/VR and robotics.
翻译:三维高斯泼溅已成为显式三维场景表示的有力范式,然而实现高效且一致的三维分割仍具挑战性。现有分割方法通常依赖高维特征提升,这导致昂贵的优化成本、隐式语义理解及任务特定约束。我们提出\textbf{在线分割任意高斯模型(SAGOnline)},这是一个统一的零样本框架,无需场景特定训练即可实现实时、跨视角一致的分割。SAGOnline将整体分割问题解耦为轻量子任务:通过集成视频基础模型(如SAM 2),我们首先生成跨渲染视角的时间一致二维掩码;关键创新在于,我们摒弃了学习连续特征场的方法,引入\textbf{光栅化感知几何共识}机制,该机制利用高斯光栅化流程的可追溯性,实现将二维预测确定性地映射为显式离散三维基元标签的实时计算。这种离散表示消除了特征蒸馏的内存与计算负担,支持即时推理。在NVOS和SPIn-NeRF基准测试中的广泛评估表明,SAGOnline在达到最优精度(92.7\%和95.2\% mIoU)的同时,以每帧27毫秒的最快速度运行。通过为多样化基础模型提供灵活接口,本框架支持即时提示分割、实例分割与语义分割,为增强现实/虚拟现实及机器人领域的交互式三维理解开辟了新路径。