Existing machine learning research has achieved promising results in monaural audio-visual separation (MAVS). However, most MAVS methods purely consider what the sound source is, not where it is located. This can be a problem in VR/AR scenarios, where listeners need to be able to distinguish between similar audio sources located in different directions. To address this limitation, we have generalized MAVS to spatial audio separation and proposed LAVSS: a location-guided audio-visual spatial audio separator. LAVSS is inspired by the correlation between spatial audio and visual location. We introduce the phase difference carried by binaural audio as spatial cues, and we utilize positional representations of sounding objects as additional modality guidance. We also leverage multi-level cross-modal attention to perform visual-positional collaboration with audio features. In addition, we adopt a pre-trained monaural separator to transfer knowledge from rich mono sounds to boost spatial audio separation. This exploits the correlation between monaural and binaural channels. Experiments on the FAIR-Play dataset demonstrate the superiority of the proposed LAVSS over existing benchmarks of audio-visual separation. Our project page: https://yyx666660.github.io/LAVSS/.
翻译:现有的机器学习研究在单声道视听觉分离(MAVS)方面已取得显著成果。然而,大多数MAVS方法仅考虑声源"是什么",而未关注其"位于何处"。这在VR/AR场景中会引发问题,因为听众需要能够区分位于不同方向的相似声源。为解决这一局限,我们将MAVS推广至空间音频分离,并提出LAVSS:一种位置引导的视听觉空间音频分离器。LAVSS受空间音频与视觉位置相关性的启发,引入双耳音频携带的相位差作为空间线索,并利用发声物体的位置表征作为额外模态引导。我们同时采用多层级跨模态注意力机制,实现音频特征与视觉位置特征的协同作用。此外,通过引入预训练的单声道分离器,将丰富单声道声音知识迁移至空间音频分离任务,从而利用单声道与双声道通道间的相关性。在FAIR-Play数据集上的实验表明,所提出的LAVSS在视听觉分离任务中优于现有基准方法。项目页面:https://yyx666660.github.io/LAVSS/。