Recently, convolutional neural networks (CNNs) have been widely used in sound event detection (SED). However, traditional convolution is deficient in learning time-frequency domain representation of different sound events. To address this issue, we propose multi-dimensional frequency dynamic convolution (MFDConv), a new design that endows convolutional kernels with frequency-adaptive dynamic properties along multiple dimensions. MFDConv utilizes a novel multi-dimensional attention mechanism with a parallel strategy to learn complementary frequency-adaptive attentions, which substantially strengthen the feature extraction ability of convolutional kernels. Moreover, in order to promote the performance of mean teacher, we propose the confident mean teacher to increase the accuracy of pseudo-labels from the teacher and train the student with high confidence labels. Experimental results show that the proposed methods achieve 0.470 and 0.692 of PSDS1 and PSDS2 on the DESED real validation dataset.
翻译:近年来,卷积神经网络(CNNs)已被广泛应用于声音事件检测(SED)。然而,传统卷积在学习不同声音事件的时频域表示方面存在不足。针对这一问题,我们提出多维频率动态卷积(MFDConv),这是一种新型设计,赋予卷积核沿多维度具有频率自适应动态特性。MFDConv采用新颖的多维注意力机制与并行策略,学习互补的频率自适应注意力,显著增强了卷积核的特征提取能力。此外,为提升均值教师模型的性能,我们提出自信均值教师模型,以提高教师模型伪标签的准确性,并利用高置信度标签训练学生模型。实验结果表明,所提方法在DESED真实验证数据集上分别达到了0.470和0.692的PSDS1和PSDS2指标。