Sound event localization and detection (SELD) combines two subtasks: sound event detection (SED) and direction of arrival (DOA) estimation. SELD is usually tackled as an audio-only problem, but visual information has been recently included. Few audio-visual (AV)-SELD works have been published and most employ vision via face/object bounding boxes, or human pose keypoints. In contrast, we explore the integration of audio and visual feature embeddings extracted with pre-trained deep networks. For the visual modality, we tested ResNet50 and Inflated 3D ConvNet (I3D). Our comparison of AV fusion methods includes the AV-Conformer and Cross-Modal Attentive Fusion (CMAF) model. Our best models outperform the DCASE 2023 Task3 audio-only and AV baselines by a wide margin on the development set of the STARSS23 dataset, making them competitive amongst state-of-the-art results of the AV challenge, without model ensembling, heavy data augmentation, or prediction post-processing. Such techniques and further pre-training could be applied as next steps to improve performance.
翻译:声音事件定位与检测(SELD)结合了两个子任务:声音事件检测(SED)和到达方向(DOA)估计。SELD通常被视为纯音频问题,但近年来视觉信息也被纳入研究。目前已有少数音频-视觉(AV)-SELD研究工作发表,多数方法通过人脸/物体边界框或人体姿态关键点利用视觉信息。相比之下,我们探索了使用预训练深度网络提取的音频与视觉特征嵌入的融合方法。在视觉模态方面,我们测试了ResNet50和膨胀3D卷积网络(I3D)。我们对比的AV融合方法包括AV-Conformer和跨模态注意力融合(CMAF)模型。在STARSS23数据集的开发集上,我们的最佳模型大幅超越了DCASE 2023 Task3的纯音频和AV基线,使其在AV挑战赛的最新成果中具备竞争力,且无需模型集成、重度数据增强或预测后处理。这些技术及进一步的预训练可作为后续步骤以提升性能。