Semantic scene completion (SSC) aims to predict complete 3D voxel occupancy and semantics from a single-view RGB-D image, and recent SSC methods commonly adopt multi-modal inputs. However, our investigation reveals two limitations: ineffective feature learning from single modalities and overfitting to limited datasets. To address these issues, this paper proposes a novel SSC framework - Adversarial Modality Modulation Network (AMMNet) - with a fresh perspective of optimizing gradient updates. The proposed AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition. Specifically, the cross-modal modulation adaptively re-calibrates the features to better excite representation potentials from each single modality. The adversarial training employs a minimax game of evolving gradients, with customized guidance to strengthen the generator's perception of visual fidelity from both geometric completeness and semantic correctness. Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin, providing a promising direction for improving the effectiveness and generalization of SSC methods.
翻译:语义场景补全(SSC)旨在从单视角RGB-D图像中预测完整的3D体素占据状态与语义信息,现有SSC方法普遍采用多模态输入。然而,我们的研究发现两个局限性:单一模态的特征学习效率低下,以及模型对小规模数据集的过拟合。针对这些问题,本文从优化梯度更新的全新视角,提出了一种新颖的SSC框架——对抗性模态调制网络(AMMNet)。该网络引入两个核心模块:实现模态间梯度流相互依赖的跨模态调制模块,以及利用动态梯度竞争的定制化对抗训练方案。具体而言,跨模态调制通过自适应重新校准特征来充分激发各单一模态的表征潜力;对抗训练则采用渐进梯度的极小极大博弈策略,配合定制化引导增强生成器对几何完整性与语义正确性的视觉保真度感知。大量实验结果表明,AMMNet以显著优势超越当前最先进的SSC方法,为提升SSC方法的有效性与泛化能力提供了极具前景的研究方向。