3D Semantic Occupancy Prediction is fundamental for spatial understanding as it provides a comprehensive semantic cognition of surrounding environments. However, prevalent approaches primarily rely on extensive labeled data and computationally intensive voxel-based modeling, restricting the scalability and generalizability of 3D representation learning. In this paper, we introduce GaussTR, a novel Gaussian Transformer that leverages alignment with foundation models to advance self-supervised 3D spatial understanding. GaussTR adopts a Transformer architecture to predict sparse sets of 3D Gaussians that represent scenes in a feed-forward manner. Through aligning rendered Gaussian features with diverse knowledge from pre-trained foundation models, GaussTR facilitates the learning of versatile 3D representations and enables open-vocabulary occupancy prediction without explicit annotations. Empirical evaluations on the Occ3D-nuScenes dataset showcase GaussTR's state-of-the-art zero-shot performance, achieving 11.70 mIoU while reducing training duration by approximately 50%. These experimental results highlight the significant potential of GaussTR for scalable and holistic 3D spatial understanding, with promising implications for autonomous driving and embodied agents. Code is available at https://github.com/hustvl/GaussTR.
翻译:三维语义占据预测是空间理解的基础任务,其能够对周围环境提供全面的语义认知。然而,现有方法主要依赖于大量标注数据和计算密集的体素建模,限制了三维表示学习的可扩展性与泛化能力。本文提出GaussTR,一种新颖的高斯Transformer,其通过与基础模型对齐来推进自监督三维空间理解。GaussTR采用Transformer架构以前馈方式预测代表场景的稀疏三维高斯集合。通过将渲染得到的高斯特征与来自预训练基础模型的多样化知识进行对齐,GaussTR促进了通用三维表示的学习,并能够在无需显式标注的情况下实现开放词汇的占据预测。在Occ3D-nuScenes数据集上的实证评估展示了GaussTR领先的零样本性能,其取得了11.70%的mIoU,同时将训练时长减少了约50%。这些实验结果凸显了GaussTR在可扩展、整体性三维空间理解方面的巨大潜力,对自动驾驶与具身智能体具有积极意义。代码发布于 https://github.com/hustvl/GaussTR。