Camera-based 3D semantic scene completion (SSC) is pivotal for predicting complicated 3D layouts with limited 2D image observations. The existing mainstream solutions generally leverage temporal information by roughly stacking history frames to supplement the current frame, such straightforward temporal modeling inevitably diminishes valid clues and increases learning difficulty. To address this problem, we present HTCL, a novel Hierarchical Temporal Context Learning paradigm for improving camera-based semantic scene completion. The primary innovation of this work involves decomposing temporal context learning into two hierarchical steps: (a) cross-frame affinity measurement and (b) affinity-based dynamic refinement. Firstly, to separate critical relevant context from redundant information, we introduce the pattern affinity with scale-aware isolation and multiple independent learners for fine-grained contextual correspondence modeling. Subsequently, to dynamically compensate for incomplete observations, we adaptively refine the feature sampling locations based on initially identified locations with high affinity and their neighboring relevant regions. Our method ranks $1^{st}$ on the SemanticKITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU on the OpenOccupancy benchmark. Our code is available on https://github.com/Arlo0o/HTCL.
翻译:基于摄像头的三维语义场景补全(SSC)对于通过有限的二维图像观测预测复杂三维场景布局至关重要。现有主流方法通常通过简单堆叠历史帧来补充当前帧以利用时序信息,这种直接的时序建模方式不可避免地削弱了有效线索并增加了学习难度。为解决这一问题,我们提出HTCL,一种新颖的层次化时序上下文学习范式,用于改进基于摄像头的语义场景补全。本工作的核心创新在于将时序上下文学习分解为两个层次化步骤:(a)跨帧关联度度量与(b)基于关联度的动态优化。首先,为从冗余信息中分离关键相关上下文,我们引入具有尺度感知隔离特性的模式关联度度量方法,并采用多个独立学习器进行细粒度上下文对应关系建模。随后,为动态补偿不完整观测,我们基于初始识别的高关联度位置及其相邻相关区域,自适应地优化特征采样位置。我们的方法在SemanticKITTI基准测试中排名第一,在OpenOccupancy基准测试的mIoU指标上甚至超越了基于激光雷达的方法。代码已开源:https://github.com/Arlo0o/HTCL。