MOGS: Monocular Object-guided Gaussian Splatting in Large Scenes

Recent advances in 3D Gaussian Splatting (3DGS) deliver striking photorealism, and extending it to large scenes opens new opportunities for semantic reasoning and prediction in applications such as autonomous driving. Today's state-of-the-art systems for large scenes primarily originate from LiDAR-based pipelines that utilize long-range depth sensing. However, they require costly high-channel sensors whose dense point clouds strain memory and computation, limiting scalability, fleet deployment, and optimization speed. We present MOGS, a monocular 3DGS framework that replaces active LiDAR depth with object-anchored, metrized dense depth derived from sparse visual-inertial (VI) structure-from-motion (SfM) cues. Our key idea is to exploit image semantics to hypothesize per-object shape priors, anchor them with sparse but metrically reliable SfM points, and propagate the resulting metric constraints across each object to produce dense depth. To address two key challenges, i.e., insufficient SfM coverage within objects and cross-object geometric inconsistency, MOGS introduces (1) a multi-scale shape consensus module that adaptively merges small segments into coarse objects best supported by SfM and fits them with parametric shape models, and (2) a cross-object depth refinement module that optimizes per-pixel depth under a combinatorial objective combining geometric consistency, prior anchoring, and edge-aware smoothness. Experiments on public datasets show that, with a low-cost VI sensor suite, MOGS reduces training time by up to 30.4% and memory consumption by 19.8%, while achieving high-quality rendering competitive with costly LiDAR-based approaches in large scenes.

翻译：近年来，三维高斯泼溅（3DGS）技术取得了显著进展，实现了惊人的照片级真实感，将其扩展至大场景为自动驾驶等应用中的语义推理与预测开辟了新机遇。当前大场景的最先进系统主要源自基于激光雷达的流程，其利用远距离深度感知。然而，这些系统需要昂贵的高通道传感器，其密集点云对内存和计算造成压力，限制了可扩展性、车队部署和优化速度。本文提出MOGS，一种单目3DGS框架，它用基于物体锚定的、从稀疏视觉惯性（VI）运动恢复结构（SfM）线索导出的度量化密集深度，取代了主动激光雷达深度。我们的核心思想是利用图像语义来假设每个物体的形状先验，用稀疏但度量可靠的SfM点对其进行锚定，并将由此产生的度量约束在物体内传播以生成密集深度。为应对两个关键挑战——即物体内部SfM覆盖不足以及跨物体几何不一致性——MOGS引入了（1）一个多尺度形状共识模块，该模块自适应地将小片段合并为由SfM最佳支持的粗粒度物体，并用参数化形状模型对其进行拟合；以及（2）一个跨物体深度优化模块，该模块在结合了几何一致性、先验锚定和边缘感知平滑性的组合目标下优化每像素深度。在公开数据集上的实验表明，采用低成本VI传感器套件，MOGS将训练时间减少了高达30.4%，内存消耗降低了19.8%，同时在大场景中实现了与昂贵的基于激光雷达方法相媲美的高质量渲染。