Automatic machine translation metrics typically rely on human translations to determine the quality of system translations. Common wisdom in the field dictates that the human references should be of very high quality. However, there are no cost-benefit analyses that could be used to guide practitioners who plan to collect references for machine translation evaluation. We find that higher-quality references lead to better metric correlations with humans at the segment-level. Having up to 7 references per segment and taking their average (or maximum) helps all metrics. Interestingly, the references from vendors of different qualities can be mixed together and improve metric success. Higher quality references, however, cost more to create and we frame this as an optimization problem: given a specific budget, what references should be collected to maximize metric success. These findings can be used by evaluators of shared tasks when references need to be created under a certain budget.
翻译:自动机器翻译评估指标通常依赖人工译文来判断系统翻译质量。该领域的普遍认知是人工参考译文应具备极高品质,但目前尚缺乏可用于指导实践者规划机器翻译评估参考译文收集工作的成本效益分析。研究发现,更高质量的参考译文能提升评估指标在句子级别与人工判断的相关性;每个句子使用7条参考译文并取其均值(或最大值)对所有指标均有助益。有趣的是,不同质量供应商的参考译文可以混合使用以提升指标效果。然而,高质量参考译文生成成本更高。我们将其构建为优化问题:给定特定预算,应收集何种参考译文才能最大化指标效能。这些发现可为需要在限定预算下构建参考译文的共享任务评估者提供指导。