Large language models have demonstrated parallel and even superior translation performance compared to neural machine translation (NMT) systems. However, existing comparative studies between them mainly rely on automated metrics, raising questions into the feasibility of these metrics and their alignment with human judgment. The present study investigates the convergences and divergences between automated metrics and human evaluation in assessing the quality of machine translation from ChatGPT and three NMT systems. To perform automatic assessment, four automated metrics are employed, while human evaluation incorporates the DQF-MQM error typology and six rubrics. Notably, automatic assessment and human evaluation converge in measuring formal fidelity (e.g., error rates), but diverge when evaluating semantic and pragmatic fidelity, with automated metrics failing to capture the improvement of ChatGPT's translation brought by prompt engineering. These results underscore the indispensable role of human judgment in evaluating the performance of advanced translation tools at the current stage.
翻译:大语言模型在翻译性能上已展现出与神经机器翻译(NMT)系统相当甚至更优的水平。然而,现有关于两者的比较研究主要依赖自动化指标,这引发了对这些指标可行性及其与人工判断一致性的质疑。本研究探讨了自动化指标与人工评价在评估ChatGPT及三个NMT系统机器翻译质量时的趋同与分歧。为实施自动评估,采用了四种自动化指标;而人工评价则纳入了DQF-MQM错误分类体系及六个评分标准。值得注意的是,自动评估与人工评价在测量形式忠实度(如错误率)上趋于一致,但在评估语义与语用忠实度时出现分歧——自动化指标未能捕捉到提示工程对ChatGPT翻译质量的提升效果。这些结果凸显了当前阶段人工判断在评估先进翻译工具性能中不可替代的作用。