Automatic machine translation metrics typically rely on human translations to determine the quality of system translations. Common wisdom in the field dictates that the human references should be of very high quality. However, there are no cost-benefit analyses that could be used to guide practitioners who plan to collect references for machine translation evaluation. We find that higher-quality references lead to better metric correlations with humans at the segment-level. Having up to 7 references per segment and taking their average (or maximum) helps all metrics. Interestingly, the references from vendors of different qualities can be mixed together and improve metric success. Higher quality references, however, cost more to create and we frame this as an optimization problem: given a specific budget, what references should be collected to maximize metric success. These findings can be used by evaluators of shared tasks when references need to be created under a certain budget.
翻译:自动机器翻译评估指标通常依赖人工参考译文来判断系统翻译质量。领域共识认为人工参考译文应具有极高品质,然而目前尚缺乏成本效益分析来指导计划收集机器翻译评估参考译文的研究人员。研究发现:更高质量的参考译文能提升指标在句子层面与人工评价的相关性;每个句子使用多达7条参考译文并取平均值(或最大值)有助于所有指标表现。有趣的是,不同质量供应商提供的参考译文可以混合使用,这同样能改善指标效果。但高质量参考译文的制作成本更高。我们将此问题建模为优化问题:在特定预算约束下,应收集哪些参考译文以最大化指标有效性。这些发现可供共享任务评估者在设定预算限制下创建参考译文时参考。