Data-Free Knowledge Distillation (DFKD) has recently attracted growing attention in the academic community, especially with major breakthroughs in computer vision. Despite promising results, the technique has not been well applied to audio and signal processing. Due to the variable duration of audio signals, it has its own unique way of modeling. In this work, we propose feature-rich audio model inversion (FRAMI), a data-free knowledge distillation framework for general sound classification tasks. It first generates high-quality and feature-rich Mel-spectrograms through a feature-invariant contrastive loss. Then, the hidden states before and after the statistics pooling layer are reused when knowledge distillation is performed on these feature-rich samples. Experimental results on the Urbansound8k, ESC-50, and audioMNIST datasets demonstrate that FRAMI can generate feature-rich samples. Meanwhile, the accuracy of the student model is further improved by reusing the hidden state and significantly outperforms the baseline method.
翻译:无数据知识蒸馏(DFKD)近年来在学术界日益受到关注,尤其在计算机视觉领域取得了重大突破。尽管成果显著,该技术尚未在音频与信号处理领域得到良好应用。由于音频信号时长可变,其建模方式具有独特性。本文提出面向通用声音分类任务的特征丰富音频模型逆推(FRAMI)无数据知识蒸馏框架。该框架首先通过特征不变对比损失生成高质量且特征丰富的梅尔频谱图;随后在对这些特征丰富样本进行知识蒸馏时,复用统计池化层前后的隐藏状态。在Urbansound8k、ESC-50和audioMNIST数据集上的实验结果表明,FRAMI能够生成特征丰富的样本。同时,通过复用隐藏状态可进一步提升学生模型的准确率,且显著优于基线方法。