This report proposes a frequency dynamic convolution (FDY) with a large kernel attention (LKA)-convolutional recurrent neural network (CRNN) with a pre-trained bidirectional encoder representation from audio transformers (BEATs) embedding-based sound event detection (SED) model that employs a mean-teacher and pseudo-label approach to address the challenge of limited labeled data for DCASE 2023 Task 4. The proposed FDY with LKA integrates the FDY and LKA module to effectively capture time-frequency patterns, long-term dependencies, and high-level semantic information in audio signals. The proposed FDY with LKA-CRNN with a BEATs embedding network is initially trained on the entire DCASE 2023 Task 4 dataset using the mean-teacher approach, generating pseudo-labels for weakly labeled, unlabeled, and the AudioSet. Subsequently, the proposed SED model is retrained using the same pseudo-label approach. A subset of these models is selected for submission, demonstrating superior F1-scores and polyphonic SED score performance on the DCASE 2023 Challenge Task 4 validation dataset.
翻译:本报告提出了一种结合大核注意力(LKA)的频率动态卷积(FDY)卷积循环神经网络(CRNN),并嵌入预训练的双向编码器音频表征(BEATs)模型,构建声音事件检测(SED)系统。该系统采用平均教师与伪标签方法,以应对DCASE 2023任务4中标注数据有限的问题。所提出的FDY与LKA融合模块能够有效提取音频信号中的时频模式、长期依赖关系及高层语义信息。该FDY-LKA-CRNN-BEATs嵌入网络首先利用平均教师方法在完整DCASE 2023任务4数据集上训练,生成弱标签、无标签数据及AudioSet的伪标签;随后,采用相同伪标签方法对SED模型进行重新训练。最终选择部分模型提交,其在DCASE 2023挑战赛任务4验证数据集上展现了优异的F1分数与多声源SED评分性能。