Frequency dynamic convolution (FDY conv) has shown the state-of-the-art performance in sound event detection (SED) using frequency-adaptive kernels obtained by frequency-varying combination of basis kernels. However, FDY conv lacks an explicit mean to diversify frequency-adaptive kernels, potentially limiting the performance. In addition, size of basis kernels is limited while time-frequency patterns span larger spectro-temporal range. Therefore, we propose dilated frequency dynamic convolution (DFD conv) which diversifies and expands frequency-adaptive kernels by introducing different dilation sizes to basis kernels. Experiments showed advantages of varying dilation sizes along frequency dimension, and analysis on attention weight variance proved dilated basis kernels are effectively diversified. By adapting class-wise median filter with intersection-based F1 score, proposed DFD-CRNN outperforms FDY-CRNN by 3.12% in terms of polyphonic sound detection score (PSDS).
翻译:频率动态卷积(FDY conv)通过频率变化的基核组合获取频率自适应核,在声音事件检测(SED)中展现出最先进的性能。然而,FDY conv缺乏明确的频率自适应核多样化手段,可能限制其性能。此外,基核尺寸有限,而时频模式覆盖更广的频谱-时间范围。为此,我们提出膨胀频率动态卷积(DFD conv),通过为基核引入不同的膨胀尺寸,实现频率自适应核的多样化与扩展。实验表明,沿频率维度改变膨胀尺寸具有优势,注意力权重方差分析证明膨胀基核得到了有效多样化。通过采用基于交集F1分数的类别自适应中值滤波,所提出的DFD-CRNN在多音源声音检测评分(PSDS)上比FDY-CRNN提升了3.12%。