There have been several meta-evaluation studies on the correlation between human ratings and offline machine translation (MT) evaluation metrics such as BLEU, chrF2, BertScore and COMET. These metrics have been used to evaluate simultaneous speech translation (SST) but their correlations with human ratings of SST, which has been recently collected as Continuous Ratings (CR), are unclear. In this paper, we leverage the evaluations of candidate systems submitted to the English-German SST task at IWSLT 2022 and conduct an extensive correlation analysis of CR and the aforementioned metrics. Our study reveals that the offline metrics are well correlated with CR and can be reliably used for evaluating machine translation in simultaneous mode, with some limitations on the test set size. We conclude that given the current quality levels of SST, these metrics can be used as proxies for CR, alleviating the need for large scale human evaluation. Additionally, we observe that correlations of the metrics with translation as a reference is significantly higher than with simultaneous interpreting, and thus we recommend the former for reliable evaluation.
翻译:已有若干元评估研究探讨人工评分与离线机器翻译评估指标(如BLEU、chrF2、BertScore和COMET)之间的相关性。这些指标已用于评估同声翻译(SST),但它们与近期以连续评分(CR)形式收集的SST人工评分之间的相关性尚不明确。本文利用提交至IWSLT 2022英德同声翻译任务的候选系统评估结果,对CR与上述指标进行了全面的相关性分析。研究表明,离线指标与CR高度相关,且可在同声模式下可靠地用于评估机器翻译,但存在测试集规模方面的局限。我们得出结论:鉴于当前SST的质量水平,这些指标可作为CR的替代方案,从而减轻大规模人工评估的需求。此外,我们观察到,这些指标与翻译(作为参照文本)的相关性显著高于与同声传译的相关性,因此建议将前者用于可靠评估。