Camera-based 3D semantic scene completion (SSC) is pivotal for predicting complicated 3D layouts with limited 2D image observations. The existing mainstream solutions generally leverage temporal information by roughly stacking history frames to supplement the current frame, such straightforward temporal modeling inevitably diminishes valid clues and increases learning difficulty. To address this problem, we present HTCL, a novel Hierarchical Temporal Context Learning paradigm for improving camera-based semantic scene completion. The primary innovation of this work involves decomposing temporal context learning into two hierarchical steps: (a) cross-frame affinity measurement and (b) affinity-based dynamic refinement. Firstly, to separate critical relevant context from redundant information, we introduce the pattern affinity with scale-aware isolation and multiple independent learners for fine-grained contextual correspondence modeling. Subsequently, to dynamically compensate for incomplete observations, we adaptively refine the feature sampling locations based on initially identified locations with high affinity and their neighboring relevant regions. Our method ranks $1^{st}$ on the SemanticKITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU on the OpenOccupancy benchmark. Our code is available on https://github.com/Arlo0o/HTCL.
翻译:基于相机的三维语义场景补全对于通过有限的二维图像观测预测复杂的三维场景布局至关重要。现有主流方法通常通过粗略堆叠历史帧来补充当前帧以利用时序信息,这种简单的时序建模不可避免地削弱了有效线索并增加了学习难度。为解决此问题,我们提出了HTCL,一种新颖的层次化时序上下文学习范式,用于改进基于相机的语义场景补全。本工作的主要创新在于将时序上下文学习分解为两个层次化步骤:(a) 跨帧关联度度量与 (b) 基于关联度的动态细化。首先,为从冗余信息中分离出关键相关上下文,我们引入了具有尺度感知隔离的模式关联度以及多个独立学习器,用于细粒度的上下文对应关系建模。随后,为动态补偿不完整的观测,我们基于初始识别出的高关联度位置及其邻近相关区域,自适应地细化特征采样位置。我们的方法在SemanticKITTI基准测试中排名第一,甚至在OpenOccupancy基准测试的mIoU指标上超越了基于LiDAR的方法。我们的代码公开于 https://github.com/Arlo0o/HTCL。