In sound event detection (SED), convolutional neural networks (CNNs) are widely employed to extract time-frequency (TF) patterns from spectrograms. However, the ability of CNNs to recognize different sound events is limited by their insensitivity to shifts of TF patterns along the frequency dimension, caused by translation equivariance. To address this issue, a model called frequency dynamic convolution (FDY) has been proposed, which involves applying specific convolution kernels to different frequency components. However, FDY requires a significantly larger number of parameters and computational resources compared to a standard CNN. This paper proposes a more efficient solution called frequency-aware convolution (FAC). FAC incorporates frequency positional information by encoding it in a vector, which is then explicitly added to the input spectrogram. To ensure that the amplitude of the encoding vector matches that of the input spectrogram, the encoding vector is adaptively and channel-dependently scaled using self-attention. To evaluate the effectiveness of FAC, we conducted experiments within the context of the DCASE 2023 task 4. The results show that FAC achieves comparable performance to FDY while requiring only an additional 515 parameters, whereas FDY necessitates an additional 8.02 million parameters. Furthermore, an ablation study confirms that the adaptive and channel-dependent scaling of the encoding vector is critical to the performance of FAC.
翻译:在声音事件检测中,卷积神经网络被广泛用于从频谱图中提取时频模式。然而,卷积神经网络识别不同声音事件的能力受限于其对时频模式沿频率维度偏移的不敏感性,这种不敏感性源于平移等变性。为解决这一问题,已有研究提出了一种称为频率动态卷积的模型,该模型对不同的频率分量应用特定的卷积核。然而,与标准卷积神经网络相比,频率动态卷积需要显著更多的参数和计算资源。本文提出了一种更高效的解决方案,称为频率感知卷积。频率感知卷积通过将频率位置信息编码为一个向量,然后将其显式地添加到输入频谱图中,从而融入频率位置信息。为确保编码向量的幅度与输入频谱图相匹配,编码向量使用自注意力机制进行自适应且通道依赖的缩放。为评估频率感知卷积的有效性,我们在DCASE 2023任务4的背景下进行了实验。结果表明,频率感知卷积实现了与频率动态卷积相当的性能,而仅需额外增加515个参数,而频率动态卷积则需要额外增加802万个参数。此外,消融研究证实,编码向量的自适应且通道依赖的缩放对于频率感知卷积的性能至关重要。