While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information to estimate the temporal activation and DOA of target sound events. Audio-visual SELD systems can detect and localize sound events using signals from a microphone array and audio-visual correspondence. We also introduce an audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23), which consists of multichannel audio data recorded with a microphone array, video data, and spatiotemporal annotation of sound events. Sound scenes in STARSS23 are recorded with instructions, which guide recording participants to ensure adequate activity and occurrences of sound events. STARSS23 also serves human-annotated temporal activation labels and human-confirmed DOA labels, which are based on tracking results of a motion capture system. Our benchmark results demonstrate the benefits of using visual object positions in audio-visual SELD tasks. The data is available at https://zenodo.org/record/7880637.
翻译:虽然声音事件的到达方向通常从麦克风阵列记录的多通道音频数据中估计,但声音事件通常源自视觉可感知的源物体,例如脚步声来自行走者的脚部。本文提出了一项视听声音事件定位与检测任务,该任务利用多通道音频和视频信息来估计目标声音事件的时间激活和到达方向。视听SELD系统可通过麦克风阵列信号与视听对应关系检测并定位声音事件。我们还引入了一个视听数据集——Sony-TAu真实空间声景2023,该数据集包含用麦克风阵列记录的多通道音频数据、视频数据以及声音事件的时空标注。STARSS23中的声音场景按照指令录制,这些指令引导录制参与者确保声音事件有足够的活动量和出现频率。STARSS23还提供了人工标注的时间激活标签和基于动作捕捉系统跟踪结果并经人工确认的到达方向标签。我们的基准测试结果证明了在视听SELD任务中使用视觉目标位置的优势。该数据可在https://zenodo.org/record/7880637获取。