This report presents the systems developed and submitted by Fortemedia Singapore (FMSG) and Joint Laboratory of Environmental Sound Sensing (JLESS) for DCASE 2024 Task 4. The task focuses on recognizing event classes and their time boundaries, given that multiple events can be present and may overlap in an audio recording. The novelty this year is a dataset with two sources, making it challenging to achieve good performance without knowing the source of the audio clips during evaluation. To address this, we propose a sound event detection method using domain generalization. Our approach integrates features from bidirectional encoder representations from audio transformers and a convolutional recurrent neural network. We focus on three main strategies to improve our method. First, we apply mixstyle to the frequency dimension to adapt the mel-spectrograms from different domains. Second, we consider training loss of our model specific to each datasets for their corresponding classes. This independent learning framework helps the model extract domain-specific features effectively. Lastly, we use the sound event bounding boxes method for post-processing. Our proposed method shows superior macro-average pAUC and polyphonic SED score performance on the DCASE 2024 Challenge Task 4 validation dataset and public evaluation dataset.
翻译:本报告介绍了 Fortemedia Singapore (FMSG) 与环境声音传感联合实验室 (JLESS) 为 DCASE 2024 任务 4 开发并提交的系统。该任务旨在识别音频记录中事件类别及其时间边界,其中可能存在多个事件且事件间可能相互重叠。今年的新颖之处在于采用了包含两个来源的数据集,这使得在评估阶段不知道音频片段来源的情况下,难以获得良好性能。为解决此问题,我们提出了一种利用领域泛化的声音事件检测方法。我们的方法整合了来自音频 Transformer 的双向编码器表征和卷积循环神经网络的特征。我们主要聚焦于三种策略以改进方法。首先,我们在频率维度应用 mixstyle 以适应来自不同域的梅尔频谱图。其次,我们针对各数据集及其对应类别,考虑模型特定的训练损失。这种独立学习框架有助于模型有效提取领域特定特征。最后,我们使用声音事件边界框方法进行后处理。我们提出的方法在 DCASE 2024 挑战赛任务 4 的验证数据集和公开评估数据集上,展现了优异的宏平均 pAUC 和复调 SED 分数性能。