Significant improvement has been achieved in automated audio captioning (AAC) with recent models. However, these models have become increasingly large as their performance is enhanced. In this work, we propose a knowledge distillation (KD) framework for AAC. Our analysis shows that in the encoder-decoder based AAC models, it is more effective to distill knowledge into the encoder as compared with the decoder. To this end, we incorporate encoder-level KD loss into training, in addition to the standard supervised loss and sequence-level KD loss. We investigate two encoder-level KD methods, based on mean squared error (MSE) loss and contrastive loss, respectively. Experimental results demonstrate that contrastive KD is more robust than MSE KD, exhibiting superior performance in data-scarce situations. By leveraging audio-only data into training in the KD framework, our student model achieves competitive performance, with an inference speed that is 19 times faster\footnote{An online demo is available at \url{https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning}}.
翻译:近年来,随着模型的不断进步,自动音频描述(AAC)领域取得了显著提升。然而,随着性能增强,这些模型的规模也日益增大。本研究提出了一种面向AAC的知识蒸馏(KD)框架。我们的分析表明,在基于编码器-解码器的AAC模型中,将知识蒸馏至编码器相比蒸馏至解码器更为有效。为此,我们在训练中引入了编码器级KD损失,以补充标准监督损失和序列级KD损失。我们研究了两种编码器级KD方法,分别基于均方误差(MSE)损失和对比损失。实验结果表明,对比KD比MSE KD更具鲁棒性,在数据稀缺场景下表现出更优的性能。通过在KD框架中利用纯音频数据进行训练,我们的学生模型实现了具有竞争力的性能,其推理速度提升了19倍\footnote{在线演示可访问:\url{https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning}}。