In a typical sound event detection (SED) system, the existence of a sound event is detected at a frame level, and consecutive frames with the same event detected are combined as one sound event. The median filter is applied as a post-processing step to remove detection errors as much as possible. However, detection errors occurring around the onset and offset of a sound event are beyond the capacity of the median filter. To address this issue, an onset and offset weighted binary cross-entropy (OWBCE) loss function is proposed in this paper, which trains the DNN model to be more robust on frames around (a) onsets and offsets. Experiments are carried out in the context of DCASE 2022 task 4. Results show that OWBCE outperforms BCE when different models are considered. For a basic CRNN, relative improvements of 6.43% in event-F1, 1.96% in PSDS1, and 2.43% in PSDS2 can be achieved by OWBCE.
翻译:在典型的声音事件检测(SED)系统中,声音事件的存在性按帧级别进行检测,连续检测到相同事件的帧组合成一个声音事件。中值滤波作为后处理步骤被应用于尽可能消除检测错误。然而,发生在声音事件起止点附近的检测错误超出了中值滤波的处理能力。针对这一问题,本文提出一种起止加权二元交叉熵(OWBCE)损失函数,通过训练深度神经网络模型使其对起止点周围的帧具有更强的鲁棒性。实验基于DCASE 2022任务4场景开展。结果表明,当采用不同模型时,OWBCE均优于BCE。对于基础CRNN模型,OWBCE在事件F1分数、PSDS1和PSDS2指标上分别实现了6.43%、1.96%和2.43%的相对提升。