Immersive communication has made significant advancements, especially with the release of the codec for Immersive Voice and Audio Services. Aiming at its further realization, the DCASE 2025 Challenge has recently introduced a task for spatial semantic segmentation of sound scenes (S5), which focuses on detecting and separating sound events in spatial sound scenes. In this paper, we explore methods for addressing the S5 task. Specifically, we present baseline S5 systems that combine audio tagging (AT) and label-queried source separation (LSS) models. We investigate two LSS approaches based on the ResUNet architecture: a) extracting a single source for each detected event and b) querying multiple sources concurrently. Since each separated source in S5 is identified by its sound event class label, we propose new class-aware metrics to evaluate both the sound sources and labels simultaneously. Experimental results on first-order ambisonics spatial audio demonstrate the effectiveness of the proposed systems and confirm the efficacy of the metrics.
翻译:沉浸式通信已取得显著进展,尤其在沉浸式语音与音频服务编解码器发布之后。为实现其进一步应用,DCASE 2025挑战赛近期引入了声场空间语义分割任务(S5),该任务专注于空间声场中声音事件的检测与分离。本文探讨了解决S5任务的方法。具体而言,我们提出了结合音频标注与标签查询源分离模型的基线S5系统。基于ResUNet架构,我们研究了两种标签查询源分离方法:a)为每个检测到的事件提取单一音源;b)同时查询多个音源。由于S5中每个分离音源均由其声音事件类别标签标识,我们提出了新的类别感知评估指标,以同步评估音源与标签质量。在一阶Ambisonics空间音频数据上的实验结果验证了所提系统的有效性,并证实了评估指标的效能。