In sound event detection (SED), convolution neural networks (CNNs) are widely used to extract time-frequency patterns from the input spectrogram. However, features extracted by CNN can be insensitive to the shift of time-frequency patterns along the frequency axis. To address this issue, frequency dynamic convolution (FDY) has been proposed, which applies different kernels to different frequency components. Compared to the vannila CNN, FDY requires several times more parameters. In this paper, a more efficient solution named frequency-aware convolution (FAC) is proposed. In FAC, frequency-positional information is encoded in a vector and added to the input spectrogram. To match the amplitude of input, the encoding vector is scaled adaptively and channel-independently. Experiments are carried out in the context of DCASE 2022 task 4, and the results demonstrate that FAC can achieve comparable performance to that of FDY with only 515 additional parameters, while FDY requires 8.02 million additional parameters. The ablation study shows that scaling the encoding vector adaptively and channel-independently is critical to the performance of FAC.
翻译:在声音事件检测中,卷积神经网络被广泛用于从输入频谱图中提取时频模式。然而,卷积神经网络提取的特征可能对时频模式沿频率轴的偏移不敏感。为解决这一问题,频率动态卷积被提出,该方法对不同频率分量应用不同的卷积核。与普通卷积神经网络相比,频率动态卷积需要数倍的参数量。本文提出了一种更高效的解决方案——频率感知卷积。在频率感知卷积中,频率位置信息被编码为一个向量,并添加到输入频谱图中。为了匹配输入的幅度,编码向量以自适应的、通道独立的方式进行缩放。在DCASE 2022任务4的背景下进行的实验表明,频率感知卷积仅需515个额外参数即可达到与频率动态卷积相当的性能,而频率动态卷积需要802万个额外参数。消融实验表明,自适应且通道独立地缩放编码向量对频率感知卷积的性能至关重要。