Augmented Reality (AR) devices, emerging as prominent mobile interaction platforms, face challenges in user safety, particularly concerning oncoming vehicles. While some solutions leverage onboard camera arrays, these cameras often have limited field-of-view (FoV) with front or downward perspectives. Addressing this, we propose a new out-of-view semantic segmentation task and Segment Beyond View (SBV), a novel audio-visual semantic segmentation method. SBV supplements the visual modality, which miss the information beyond FoV, with the auditory information using a teacher-student distillation model (Omni2Ego). The model consists of a vision teacher utilising panoramic information, an auditory teacher with 8-channel audio, and an audio-visual student that takes views with limited FoV and binaural audio as input and produce semantic segmentation for objects outside FoV. SBV outperforms existing models in comparative evaluations and shows a consistent performance across varying FoV ranges and in monaural audio settings.
翻译:增强现实(AR)设备作为新兴的移动交互平台,在用户安全方面面临挑战,尤其涉及接近中的车辆。现有解决方案虽可利用机载相机阵列,但这些相机通常视野有限(仅覆盖前方或下方视角)。为此,我们提出一个新的视外语义分割任务及超越视界分割(SBV)方法——一种创新的视听语义分割方案。SBV通过师生蒸馏模型(Omni2Ego),利用听觉信息补充视觉模态中缺失的视界外信息。该模型包含:利用全景信息的视觉教师、采用8通道音频的听觉教师,以及以有限视野视图和双耳音频为输入、生成视界外物体语义分割结果的视听学生。对比评估表明,SBV优于现有模型,并在不同视野范围及单声道音频设置下均保持稳定性能。