Sound event detection (SED) is essential for recognizing specific sounds and their temporal locations within acoustic signals. This becomes challenging particularly for on-device applications, where computational resources are limited. To address this issue, we introduce a novel framework referred to as dual knowledge distillation for developing efficient SED systems in this work. Our proposed dual knowledge distillation commences with temporal-averaging knowledge distillation (TAKD), utilizing a mean student model derived from the temporal averaging of the student model's parameters. This allows the student model to indirectly learn from a pre-trained teacher model, ensuring a stable knowledge distillation. Subsequently, we introduce embedding-enhanced feature distillation (EEFD), which involves incorporating an embedding distillation layer within the student model to bolster contextual learning. On DCASE 2023 Task 4A public evaluation dataset, our proposed SED system with dual knowledge distillation having merely one-third of the baseline model's parameters, demonstrates superior performance in terms of PSDS1 and PSDS2. This highlights the importance of proposed dual knowledge distillation for compact SED systems, which can be ideal for edge devices.
翻译:声音事件检测对于识别声学信号中的特定声音及其时间位置至关重要。在计算资源有限的设备端应用中,这一任务尤其具有挑战性。为解决此问题,本文提出了一种名为双知识蒸馏的新框架,用于开发高效的声音事件检测系统。我们提出的双知识蒸馏首先采用时间平均知识蒸馏,通过学生模型参数的时间平均生成均值学生模型,使学生模型间接学习预训练教师模型的知识,确保稳定的知识蒸馏。随后,我们引入了嵌入增强特征蒸馏,在学生模型中嵌入蒸馏层以强化上下文学习。在DCASE 2023 Task 4A公开评估数据集上,我们提出的双知识蒸馏声音事件检测系统仅使用基线模型三分之一的参数量,便在PSDS1和PSDS2指标上展现出更优性能。这凸显了所提出的双知识蒸馏对于适用于边缘设备的紧凑型声音事件检测系统的重要性。