We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs two acoustic representation models, EnCodec and CLAP, along with a pretrained language model, BART. We also introduce a new training objective called masked codec modeling that improves acoustic awareness of the pretrained language model. Experimental results on AudioCaps and Clotho demonstrate that our model surpasses the performance of baseline models. Source code will be available at https://github.com/jaeyeonkim99/EnCLAP . An online demo is available at https://huggingface.co/spaces/enclap-team/enclap .
翻译:我们提出EnCLAP,一种用于自动音频描述生成的新颖框架。该框架采用两种声学表征模型EnCodec和CLAP,并结合预训练语言模型BART。我们同时引入一种名为掩码编解码建模的新训练目标,该目标能够提升预训练语言模型对声学特征的感知能力。在AudioCaps和Clotho数据集上的实验结果表明,我们的模型超越了基线模型的性能。源代码将发布在https://github.com/jaeyeonkim99/EnCLAP,在线演示可通过https://huggingface.co/spaces/enclap-team/enclap 访问。