While knowledge distillation has shown success in various audio tasks, its application to environmental sound classification often overlooks essential low-level audio texture features needed to capture local patterns in complex acoustic environments. To address this gap, the Structural and Statistical Audio Texture Knowledge Distillation (SSATKD) framework is proposed, which combines high-level contextual information with low-level structural and statistical audio textures extracted from intermediate layers. To evaluate its generalizability to a broad range of applications, SSATKD is tested on four diverse datasets within the environmental sound classification domain, namely two passive sonar datasets: DeepShip and Vessel Type Underwater Acoustic Data (VTUAD) and two general environmental sound datasets: Environmental Sound Classification 50 (ESC-50) and UrbanSound8K. Two teacher adaptation strategies are explored: classifier-head-only adaptation and full fine-tuning. The framework is further evaluated using various convolutional and transformer-based teacher models. Experimental results demonstrate consistent accuracy improvements across all datasets and settings, confirming the effectiveness and robustness of SSATKD in real-world sound classification tasks.
翻译:尽管知识蒸馏已在多种音频任务中取得成功,但其在环境声音分类中的应用往往忽略了捕捉复杂声学环境中局部模式所需的关键低层音频纹理特征。为弥补这一不足,本文提出了结构与统计音频纹理知识蒸馏(SSATKD)框架,该框架将高层上下文信息与从中间层提取的低层结构和统计音频纹理相结合。为评估其对广泛应用的泛化能力,SSATKD在环境声音分类领域的四个多样化数据集上进行了测试,包括两个被动声纳数据集:DeepShip与水下声学舰船类型数据(VTUAD),以及两个通用环境声音数据集:环境声音分类50(ESC-50)与UrbanSound8K。研究探索了两种教师模型适应策略:仅分类头适应与完整微调。该框架还使用多种基于卷积和Transformer的教师模型进行了进一步评估。实验结果表明,在所有数据集和设置中均实现了持续的准确率提升,证实了SSATKD在实际声音分类任务中的有效性与鲁棒性。