This paper describes sound event localization and detection (SELD) for spatial audio recordings captured by firstorder ambisonics (FOA) microphones. In this task, one may train a deep neural network (DNN) using FOA data annotated with the classes and directions of arrival (DOAs) of sound events. However, the performance of this approach is severely bounded by the amount of annotated data. To overcome this limitation, we propose a novel method of pretraining the feature extraction part of the DNN in a self-supervised manner. We use spatial audio-visual recordings abundantly available as virtual reality contents. Assuming that sound objects are concurrently observed by the FOA microphones and the omni-directional camera, we jointly train audio and visual encoders with contrastive learning such that the audio and visual embeddings of the same recording and DOA are made close. A key feature of our method is that the DOA-wise audio embeddings are jointly extracted from the raw audio data, while the DOA-wise visual embeddings are separately extracted from the local visual crops centered on the corresponding DOA. This encourages the latent features of the audio encoder to represent both the classes and DOAs of sound events. The experiment using the DCASE2022 Task 3 dataset of 20 hours shows non-annotated audio-visual recordings of 100 hours reduced the error score of SELD from 36.4 pts to 34.9 pts.
翻译:本文描述了针对一阶Ambisonics(FOA)麦克风采集的空间音频录音进行声音事件定位与检测(SELD)的研究。在此任务中,通常可以使用标注了声音事件类别及其到达方向(DOA)的FOA数据来训练深度神经网络(DNN)。然而,该方法的性能严重受限于标注数据的数量。为克服此限制,我们提出了一种新颖的自监督预训练方法,用于训练DNN的特征提取部分。我们利用作为虚拟现实内容大量可用的空间音频-视觉录音。假设声音对象同时被FOA麦克风和全向摄像机观测到,我们通过对比学习联合训练音频和视觉编码器,使得同一录音及同一DOA的音频与视觉嵌入彼此接近。我们方法的一个关键特点是:基于DOA的音频嵌入是从原始音频数据中联合提取的,而基于DOA的视觉嵌入则是从以相应DOA为中心的局部视觉裁剪中分别提取的。这促使音频编码器的潜在特征能够同时表示声音事件的类别和DOA。使用DCASE2022任务3数据集中20小时数据的实验表明,100小时无标注的音频-视觉录音将SELD的错误分数从36.4分降低至34.9分。