Recently, convolutional neural networks (CNNs) have been widely used in sound event detection (SED). However, traditional convolution is deficient in learning time-frequency domain representation of different sound events. To address this issue, we propose multi-dimensional frequency dynamic convolution (MFDConv), a new design that endows convolutional kernels with frequency-adaptive dynamic properties along multiple dimensions. MFDConv utilizes a novel multi-dimensional attention mechanism with a parallel strategy to learn complementary frequency-adaptive attentions, which substantially strengthen the feature extraction ability of convolutional kernels. Moreover, in order to promote the performance of mean teacher, we propose the confident mean teacher to increase the accuracy of pseudo-labels from the teacher and train the student with high confidence labels. Experimental results show that the proposed methods achieve 0.470 and 0.692 of PSDS1 and PSDS2 on the DESED real validation dataset.
翻译:近期,卷积神经网络(CNN)被广泛应用于声音事件检测(SED)。然而,传统卷积在学习不同声音事件的时频域表征方面存在不足。为解决该问题,我们提出了多维频率动态卷积(MFDConv),这是一种新设计,能使卷积核沿多个维度具备频率自适应动态特性。MFDConv采用新颖的多维注意力机制与并行策略,学习互补的频率自适应注意力,从而显著增强卷积核的特征提取能力。此外,为提升均值教师模型的性能,我们提出了置信均值教师模型,用以提高教师模型伪标签的准确性,并利用高置信度标签训练学生模型。实验结果表明,所提方法在DESED真实验证数据集上的PSDS1和PSDS2分别达到了0.470和0.692。