The Automated Audio Captioning (AAC) task aims to describe an audio signal using natural language. To evaluate machine-generated captions, the metrics should take into account audio events, acoustic scenes, paralinguistics, signal characteristics, and other audio information. Traditional AAC evaluation relies on natural language generation metrics like ROUGE and BLEU, image captioning metrics such as SPICE and CIDEr, or Sentence-BERT embedding similarity. However, these metrics only compare generated captions to human references, overlooking the audio signal itself. In this work, we propose MACE (Multimodal Audio-Caption Evaluation), a novel metric that integrates both audio and reference captions for comprehensive audio caption evaluation. MACE incorporates audio information from audio as well as predicted and reference captions and weights it with a fluency penalty. Our experiments demonstrate MACE's superior performance in predicting human quality judgments compared to traditional metrics. Specifically, MACE achieves a 3.28% and 4.36% relative accuracy improvement over the FENSE metric on the AudioCaps-Eval and Clotho-Eval datasets respectively. Moreover, it significantly outperforms all the previous metrics on the audio captioning evaluation task. The metric is opensourced at https://github.com/satvik-dixit/mace
翻译:自动音频描述(AAC)任务旨在使用自然语言描述音频信号。为了评估机器生成的描述,评估指标应综合考虑音频事件、声学场景、副语言学特征、信号特性以及其他音频信息。传统的AAC评估依赖于自然语言生成指标(如ROUGE和BLEU)、图像描述指标(如SPICE和CIDEr)或Sentence-BERT嵌入相似度。然而,这些指标仅将生成的描述与人工参考描述进行比较,忽略了音频信号本身。在本研究中,我们提出了MACE(多模态音频-描述评估),这是一种新颖的指标,它整合了音频和参考描述,以实现全面的音频描述评估。MACE从音频以及预测描述和参考描述中提取音频信息,并通过流畅性惩罚进行加权。我们的实验表明,与传统指标相比,MACE在预测人类质量判断方面具有更优的性能。具体而言,在AudioCaps-Eval和Clotho-Eval数据集上,MACE相对于FENSE指标的相对准确率分别提高了3.28%和4.36%。此外,在音频描述评估任务中,它显著优于所有先前的指标。该指标已在https://github.com/satvik-dixit/mace开源。