Sound can convey significant information for spatial reasoning in our daily lives. To endow deep networks with such ability, we address the challenge of dense indoor prediction with sound in both 2D and 3D via cross-modal knowledge distillation. In this work, we propose a Spatial Alignment via Matching (SAM) distillation framework that elicits local correspondence between the two modalities in vision-to-audio knowledge transfer. SAM integrates audio features with visually coherent learnable spatial embeddings to resolve inconsistencies in multiple layers of a student model. Our approach does not rely on a specific input representation, allowing for flexibility in the input shapes or dimensions without performance degradation. With a newly curated benchmark named Dense Auditory Prediction of Surroundings (DAPS), we are the first to tackle dense indoor prediction of omnidirectional surroundings in both 2D and 3D with audio observations. Specifically, for audio-based depth estimation, semantic segmentation, and challenging 3D scene reconstruction, the proposed distillation framework consistently achieves state-of-the-art performance across various metrics and backbone architectures.
翻译:声音能够传递日常生活中空间推理所需的重要信息。为使深度网络具备这种能力,我们通过跨模态知识蒸馏,应对基于声音的2D和3D密集室内预测挑战。本文提出一种基于匹配的空间对齐(SAM)蒸馏框架,在视觉-音频知识迁移中激发两种模态间的局部对应关系。SAM将音频特征与视觉一致的、可学习的空间嵌入进行整合,以解决学生模型多层间的不一致性问题。我们的方法不依赖特定输入表示,允许输入形状或尺寸灵活变化而性能不受影响。通过新构建的基准数据集DAPS(密集听觉环境感知),我们首次利用音频观测实现全向环境在2D和3D中的密集室内预测。具体而言,在基于音频的深度估计、语义分割及具有挑战性的3D场景重建任务中,所提出的蒸馏框架在不同评估指标和骨干网络架构下均持续取得最优性能。