Capturing 4D spatiotemporal surroundings is crucial for the safe and reliable operation of robots in dynamic environments. However, most existing methods address only one side of the problem: they either provide coarse geometric tracking via bounding boxes, or detailed 3D structures like voxel-based occupancy that lack explicit temporal association. In this work, we present Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking (LaGS) that advances spatiotemporal scene understanding in a holistic direction. Our approach incorporates camera-based end-to-end tracking with mask-based multi-view panoptic occupancy prediction, and addresses the key challenge of efficiently aggregating multi-view information into 3D voxel grids via a novel latent Gaussian splatting approach. Specifically, we first fuse observations into 3D Gaussians that serve as a sparse point-centric latent representation of the 3D scene, and then splat the aggregated features onto a 3D voxel grid that is decoded by a mask-based segmentation head. We evaluate LaGS on the Occ3D nuScenes and Waymo datasets, achieving state-of-the-art performance for 4D panoptic occupancy tracking. We make our code available at https://lags.cs.uni-freiburg.de/.
翻译:捕捉四维时空环境对于机器人在动态环境中安全可靠运行至关重要。然而,现有方法大多仅解决该问题的一个方面:它们要么通过边界框提供粗略的几何跟踪,要么提供缺乏显式时间关联的详细三维结构(如基于体素的占据表示)。本工作提出用于4D全景占据跟踪的潜在高斯泼溅方法(LaGS),该框架将时空场景理解向整体性方向推进。我们的方法整合了基于摄像头的端到端跟踪与基于掩码的多视角全景占据预测,并通过一种新颖的潜在高斯泼溅方法,解决了将多视角信息高效聚合到三维体素网格中的关键挑战。具体而言,我们首先将观测数据融合为三维高斯分布,以此作为三维场景的稀疏点中心潜在表示,随后将聚合特征泼溅到三维体素网格上,并通过基于掩码的分割头进行解码。我们在Occ3D nuScenes和Waymo数据集上评估LaGS,在4D全景占据跟踪任务中取得了最先进的性能。代码发布于https://lags.cs.uni-freiburg.de/。