Existing learning-based occupancy prediction methods rely on large-scale 3D annotations and generalize poorly across environments. We present FreeOcc, a training-free framework for open-vocabulary occupancy prediction from monocular or RGB-D sequences. Unlike prior approaches that require voxel-level supervision and ground-truth camera poses, FreeOcc operates without 3D annotations, pose ground truth, or any learning stage. FreeOcc incrementally builds a globally consistent occupancy map via a four-layer pipeline: a SLAM backbone estimates poses and sparse geometry; a geometrically consistent Gaussian update constructs dense 3D Gaussian maps; open-vocabulary semantics from off-the-shelf vision-language models are associated with Gaussian primitives; and a probabilistic Gaussian-to-occupancy projection produces dense voxel occupancy. Despite being entirely training-free and pose-agnostic, FreeOcc achieves over $2\times$ improvements in IoU and mIoU on EmbodiedOcc-ScanNet compared to prior self-supervised methods. We further introduce ReplicaOcc, a benchmark for indoor open-vocabulary occupancy prediction, and show that FreeOcc transfers zero-shot to novel environments, substantially outperforming both supervised and self-supervised baselines. Project page: https://the-masses.github.io/freeocc-web/.
翻译:现有基于学习的占用预测方法依赖于大规模3D标注,且在跨环境时泛化性能较差。我们提出FreeOcc,一种从单目或RGB-D序列进行开放词汇占用预测的免训练框架。与需要体素级监督和真值相机位姿的先前方法不同,FreeOcc无需3D标注、位姿真值或任何学习阶段。FreeOcc通过四层流水线逐步构建全局一致的占用地图:SLAM骨干网络估计位姿和稀疏几何;几何一致的高斯更新构建稠密3D高斯地图;来自现成视觉语言模型的开放词汇语义与高斯基元关联;概率化的高斯到占用投影生成稠密体素占用。尽管完全免训练且不依赖位姿,FreeOcc在EmbodiedOcc-ScanNet上相比先前自监督方法实现了IoU和mIoU超过2倍的提升。我们进一步引入ReplicaOcc——一个用于室内开放词汇占用预测的基准测试,并展示FreeOcc可零样本迁移至新环境,显著优于有监督和自监督基线方法。项目页面:https://the-masses.github.io/freeocc-web/。