Metrics to evaluate audio captions simply provide a score without much explanation regarding what may be wrong in case the score is low. Manual human intervention is needed to find any shortcomings of the caption. In this work, we introduce a metric which automatically identifies the shortcomings of an audio caption by detecting the misses and false alarms in a candidate caption with respect to a reference caption, and reports the recall, precision and F-score. Such a metric is very useful in profiling the deficiencies of an audio captioning model, which is a milestone towards improving the quality of audio captions.
翻译:用于评估音频字幕的指标仅提供一个分数,而未详细解释分数较低时可能存在的问题。若要发现字幕中的缺陷,仍需人工干预。在本研究中,我们提出了一种指标,该指标能通过检测候选字幕相对于参考字幕中的遗漏和误报,自动识别音频字幕的不足,并报告召回率、精确率和F值。这类指标对于分析音频字幕模型的缺陷非常有用,是提升音频字幕质量的一个重要里程碑。