Semantic scene completion (SSC) aims to predict complete 3D voxel occupancy and semantics from a single-view RGB-D image, and recent SSC methods commonly adopt multi-modal inputs. However, our investigation reveals two limitations: ineffective feature learning from single modalities and overfitting to limited datasets. To address these issues, this paper proposes a novel SSC framework - Adversarial Modality Modulation Network (AMMNet) - with a fresh perspective of optimizing gradient updates. The proposed AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition. Specifically, the cross-modal modulation adaptively re-calibrates the features to better excite representation potentials from each single modality. The adversarial training employs a minimax game of evolving gradients, with customized guidance to strengthen the generator's perception of visual fidelity from both geometric completeness and semantic correctness. Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin, providing a promising direction for improving the effectiveness and generalization of SSC methods.
翻译:语义场景补全(SSC)旨在从单视角RGB-D图像预测完整的3D体素占用和语义信息,当前的SSC方法普遍采用多模态输入。然而,我们的研究发现存在两个局限:单模态特征学习效率低下以及对有限数据集的过拟合。为解决这些问题,本文从优化梯度更新的新颖视角提出了一种名为对抗性模态调制网络(AMMNet)的SSC框架。该框架引入两个核心模块:实现模态间梯度流相互依赖的跨模态调制模块,以及利用动态梯度竞争的定制化对抗训练方案。具体而言,跨模态调制模块通过自适应重新校准特征,更好地激发各单模态的表征潜能;对抗训练采用梯度演进的极小化极大博弈,通过定制化引导增强生成器对几何完整性和语义正确性这两方面视觉保真度的感知能力。大量实验结果表明,AMMNet显著优于现有最优SSC方法,为提升SSC方法的有效性和泛化能力提供了富有前景的研究方向。