Sound event localization and detection with distance estimation (3D SELD) in video involves identifying active sound events at each time frame while estimating their spatial coordinates. This multimodal task requires joint reasoning across semantic, spatial, and temporal dimensions, a challenge that single models often struggle to address effectively. To tackle this, we introduce the Team of Specialists (ToS) ensemble framework, which integrates three complementary sub-networks: a spatio-linguistic model, a spatio-temporal model, and a tempo-linguistic model. Each sub-network specializes in a unique pair of dimensions, contributing distinct insights to the final prediction, akin to a collaborative team with diverse expertise. ToS has been benchmarked against state-of-the-art audio-visual models for 3D SELD on the DCASE2025 Task 3 Stereo SELD development set, consistently outperforming existing methods across key metrics. Future work will extend this proof of concept by strengthening the specialists with appropriate tasks, training, and pre-training curricula.
翻译:视频中带距离估计的声事件定位与检测(3D SELD)涉及识别每一时间帧中的活跃声事件,并同时估计其空间坐标。这项多模态任务需要在语义、空间和时间维度上进行联合推理,单一模型往往难以有效应对这一挑战。为此,我们提出了专家团队(ToS)集成框架,该框架整合了三个互补的子网络:一个空间-语言模型、一个空间-时间模型以及一个时间-语言模型。每个子网络专精于一对独特的维度,为最终预测贡献不同的见解,类似于一个拥有多样化专业知识的协作团队。ToS已在DCASE2025任务3立体声SELD开发集上,针对3D SELD任务与最先进的视听模型进行了基准测试,在关键指标上持续优于现有方法。未来的工作将通过为各专家设计适当的任务、训练及预训练方案来强化其能力,从而扩展这一概念验证。