This paper focuses on few-shot Sound Event Detection (SED), which aims to automatically recognize and classify sound events with limited samples. However, prevailing methods methods in few-shot SED predominantly rely on segment-level predictions, which often providing detailed, fine-grained predictions, particularly for events of brief duration. Although frame-level prediction strategies have been proposed to overcome these limitations, these strategies commonly face difficulties with prediction truncation caused by background noise. To alleviate this issue, we introduces an innovative multitask frame-level SED framework. In addition, we introduce TimeFilterAug, a linear timing mask for data augmentation, to increase the model's robustness and adaptability to diverse acoustic environments. The proposed method achieves a F-score of 63.8%, securing the 1st rank in the few-shot bioacoustic event detection category of the Detection and Classification of Acoustic Scenes and Events Challenge 2023.
翻译:本文聚焦于少样本声音事件检测任务,旨在利用有限的样本自动识别和分类声音事件。然而,当前少样本声音事件检测的主流方法主要依赖于片段级预测,其提供的细粒度预测往往不够精准,尤其是对于短时事件。尽管已有研究提出帧级预测策略以克服上述局限性,但这类策略常因背景噪声导致的预测截断问题而面临困难。为缓解该问题,本文提出了一种创新的多任务帧级声音事件检测框架。此外,我们引入TimeFilterAug——一种线性时序掩码数据增强方法,以提升模型对不同声学环境的鲁棒性与适应性。所提方法在2023年声场景与事件检测与分类挑战赛的少样本生物声学事件检测类别中取得了63.8%的F值,排名第一。