Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to improve the accuracy of 3D occupancy predictions. By sampling points along the line of sight of each voxel and integrating the features of these points from historical frames, we construct a cost volume feature map that refines current volume features for improved prediction outcomes. Our method takes advantage of parallax cues from historical observations and employs a data-driven approach to learn the cost volume. We validate the effectiveness of CVT-Occ through rigorous experiments on the Occ3D-Waymo dataset, where it outperforms state-of-the-art methods in 3D occupancy prediction with minimal additional computational cost. The code is released at \url{https://github.com/Tsinghua-MARS-Lab/CVT-Occ}.
翻译:基于视觉的三维占据栅格预测因单目视觉在深度估计方面的固有局限性而面临重大挑战。本文提出CVT-Occ,这是一种新颖的方法,它通过体素在时间维度上的几何对应关系进行时序融合,以提高三维占据栅格预测的准确性。通过沿每个体素的视线方向采样点,并整合这些点来自历史帧的特征,我们构建了一个代价体积特征图,用于优化当前体积特征,从而提升预测结果。我们的方法利用了历史观测中的视差线索,并采用数据驱动的方式学习代价体积。我们在Occ3D-Waymo数据集上通过严谨的实验验证了CVT-Occ的有效性,结果表明,它以极小的额外计算成本,在三维占据栅格预测任务上超越了现有最优方法。代码发布于 \url{https://github.com/Tsinghua-MARS-Lab/CVT-Occ}。