Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the splitand-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample. During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions in the elegant Transformer decoder framework. Doing so enables the model generate accurate event boundaries from even noisy queries during inference. Extensive experiments on the Urban-SED and EPIC-Sounds datasets demonstrate that our model significantly outperforms existing alternatives, with 40+% faster convergence in training.
翻译:声音事件检测(SED)旨在从无约束的音频样本中预测所有感兴趣事件的时间边界及其类别标签。无论是采用分割-分类(即帧级)策略还是更具原则性的事件级建模方法,现有方法均从判别式学习视角处理SED问题。在本工作中,我们从生成式学习视角重新构建了SED问题。具体而言,我们旨在通过去噪扩散过程从噪声提议中生成声音时间边界,并以目标音频样本为条件。在训练阶段,我们的模型通过将噪声潜在查询优雅地转换为Transformer解码器框架中的真实版本,学习逆转噪声添加过程。这使得模型在推理阶段即使面对噪声查询也能生成准确的事件边界。在Urban-SED和EPIC-Sounds数据集上的大量实验表明,我们的模型显著优于现有替代方案,且训练收敛速度提升40%以上。