Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by -Base (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.
翻译:自动音频描述(AAC)是一项将音频内容以自然语言描述的音频到文本任务。近年来,大型语言模型(LLM)的进展以及音频编码器训练方法的改进,为提升AAC性能提供了新的可能。为此,我们从三个方面探索增强AAC的方法:1)采用通过一致性集成蒸馏(CED)预训练的音频编码器提升声学标记的有效性,并利用查询变换器(Q-Former)弥合与LLM的模态差异并压缩声学标记;2)研究使用参数规模为70亿的Llama 2作为解码器的优势;3)引入另一预训练LLM校正因训练数据不足和标注歧义导致的文本错误。音频编码器与文本解码器均通过低秩自适应(LoRA)进行优化。实验表明各项增强策略均有效。我们的方法取得了33.0的SPIDEr-FL分数,超越了DCASE 2023任务6A的优胜系统。