This paper investigates two complementary paradigms for predicting machine translation (MT) quality: source-side difficulty prediction and candidate-side quality estimation (QE). The rapid adoption of Large Language Models (LLMs) into MT workflows is reshaping the research landscape, yet its impact on established quality prediction paradigms remains underexplored. We study this issue through a series of "hindsight" experiments on a unique, multi-candidate dataset resulting from a genuine MT post-editing (MTPE) project. The dataset consists of over 6,000 English source segments with nine translation hypotheses from a diverse set of traditional neural MT systems and advanced LLMs, all evaluated against a single, final human post-edited reference. Using Kendall's rank correlation, we assess the predictive power of source-side difficulty metrics, candidate-side QE models and position heuristics against two gold-standard scores: TER (as a proxy for post-editing effort) and COMET (as a proxy for human judgment). Our findings highlight that the architectural shift towards LLMs alters the reliability of established quality prediction methods while simultaneously mitigating previous challenges in document-level translation.
翻译:本文研究了两种互补的机器翻译质量预测范式:源端难度预测与候选端质量估计。大型语言模型在机器翻译工作流程中的迅速采用正在重塑研究格局,但其对既有质量预测范式的影响仍未得到充分探索。我们通过对一个独特的多候选数据集进行一系列"事后"实验来研究此问题,该数据集源自一个真实的机器翻译后编辑项目。该数据集包含超过6000个英语源语句段,每个句段配有来自多样化传统神经机器翻译系统和先进大型语言模型的九种翻译假设,所有假设均以单一、最终的人工后编辑参考译文为基准进行评估。利用肯德尔秩相关系数,我们评估了源端难度指标、候选端质量估计模型及位置启发式方法针对两个黄金标准分数的预测能力:TER(作为后编辑工作量的代理指标)和COMET(作为人工评判的代理指标)。我们的研究结果表明,向大型语言模型的架构转变改变了既有质量预测方法的可靠性,同时缓解了先前在文档级翻译中面临的挑战。