Music captioning has emerged as a promising task, fueled by the advent of advanced language generation models. However, the evaluation of music captioning relies heavily on traditional metrics such as BLEU, METEOR, and ROUGE which were developed for other domains, without proper justification for their use in this new field. We present cases where traditional metrics are vulnerable to syntactic changes, and show they do not correlate well with human judgments. By addressing these issues, we aim to emphasize the need for a critical reevaluation of how music captions are assessed.
翻译:音乐描述已成为一项前景广阔的任务,这得益于先进语言生成模型的出现。然而,音乐描述的评价严重依赖BLEU、METEOR和ROUGE等传统指标,这些指标最初是为其他领域开发的,其在这一新领域的适用性缺乏充分依据。我们展示了传统指标易受句法变化影响的案例,并证明其与人类判断的相关性较弱。通过探讨这些问题,我们旨在强调对音乐描述评估方式进行批判性重新审视的必要性。