As research on machine translation moves to translating text beyond the sentence level, it remains unclear how effective automatic evaluation metrics are at scoring longer translations. In this work, we first propose a method for creating paragraph-level data for training and meta-evaluating metrics from existing sentence-level data. Then, we use these new datasets to benchmark existing sentence-level metrics as well as train learned metrics at the paragraph level. Interestingly, our experimental results demonstrate that using sentence-level metrics to score entire paragraphs is equally as effective as using a metric designed to work at the paragraph level. We speculate this result can be attributed to properties of the task of reference-based evaluation as well as limitations of our datasets with respect to capturing all types of phenomena that occur in paragraph-level translations.
翻译:随着机器翻译研究从句子级向超句文本翻译发展,自动评估指标在长文本评分中的有效性仍不明确。本文首先提出一种方法,通过现有句子级数据构建用于训练和元评估段落级指标的数据集。随后,我们利用这些新数据集对现有句子级指标进行基准测试,并在段落层面训练学习型指标。有趣的是,实验结果表明,使用句子级指标对整个段落进行评分与使用专为段落级设计的指标同样有效。我们推测这一结果可归因于参考评估任务的性质,以及当前数据集在捕捉段落级翻译中所有类型现象方面存在的局限性。