As research on machine translation moves to translating text beyond the sentence level, it remains unclear how effective automatic evaluation metrics are at scoring longer translations. In this work, we first propose a method for creating paragraph-level data for training and meta-evaluating metrics from existing sentence-level data. Then, we use these new datasets to benchmark existing sentence-level metrics as well as train learned metrics at the paragraph level. Interestingly, our experimental results demonstrate that using sentence-level metrics to score entire paragraphs is equally as effective as using a metric designed to work at the paragraph level. We speculate this result can be attributed to properties of the task of reference-based evaluation as well as limitations of our datasets with respect to capturing all types of phenomena that occur in paragraph-level translations.
翻译:随着机器翻译研究超越句子层面转向文本翻译,自动评估度量在评分较长译文时的有效性仍不明确。本研究首先提出一种方法,利用现有句子级数据创建用于训练和元评估度量的段落级数据。随后,我们使用这些新数据集对现有句子级度量进行基准测试,并在段落层面训练学习型度量。有趣的是,实验结果表明,使用句子级度量对整个段落进行评分与使用专为段落层面设计的度量效果相当。我们推测这一结果可归因于参考译文评估任务的性质,以及我们的数据集在捕捉段落级翻译中所有类型现象方面存在的局限性。