Augmented Reality (AR) devices, emerging as prominent mobile interaction platforms, face challenges in user safety, particularly concerning oncoming vehicles. While some solutions leverage onboard camera arrays, these cameras often have limited field-of-view (FoV) with front or downward perspectives. Addressing this, we propose a new out-of-view semantic segmentation task and Segment Beyond View (SBV), a novel audio-visual semantic segmentation method. SBV supplements the visual modality, which miss the information beyond FoV, with the auditory information using a teacher-student distillation model (Omni2Ego). The model consists of a vision teacher utilising panoramic information, an auditory teacher with 8-channel audio, and an audio-visual student that takes views with limited FoV and binaural audio as input and produce semantic segmentation for objects outside FoV. SBV outperforms existing models in comparative evaluations and shows a consistent performance across varying FoV ranges and in monaural audio settings.
翻译:增强现实(AR)设备作为新兴的移动交互平台,在用户安全方面面临挑战,尤其涉及接近车辆的问题。现有解决方案虽可利用车载摄像头阵列,但这些摄像头通常因前视或俯视视角而视野受限。针对此问题,我们提出了一种新的视野外语义分割任务,以及一种创新的视听语义分割方法——超越视野分割(SBV)。该方法通过师生蒸馏模型(Omni2Ego),利用听觉信息补充视觉模态中缺失的视野外信息。模型包含一个利用全景信息的视觉教师、一个基于8通道音频的听觉教师,以及一个以有限视野和双耳音频为输入、对视野外物体进行语义分割的视听学生。对比评估表明,SBV在性能上优于现有模型,并在不同视野范围及单声道音频场景下均保持一致的表现。