Quality Estimation (QE) is the task of predicting the quality of Machine Translation (MT) system output, without using any gold-standard translation references. State-of-the-art QE models are supervised: they require human-labeled quality of some MT system output on some datasets for training, making them domain-dependent and MT-system-dependent. There has been research on unsupervised QE, which requires glass-box access to the MT systems, or parallel MT data to generate synthetic errors for training QE models. In this paper, we present Perturbation-based QE - a word-level Quality Estimation approach that works simply by analyzing MT system output on perturbed input source sentences. Our approach is unsupervised, explainable, and can evaluate any type of blackbox MT systems, including the currently prominent large language models (LLMs) with opaque internal processes. For language directions with no labeled QE data, our approach has similar or better performance than the zero-shot supervised approach on the WMT21 shared task. Our approach is better at detecting gender bias and word-sense-disambiguation errors in translation than supervised QE, indicating its robustness to out-of-domain usage. The performance gap is larger when detecting errors on a nontraditional translation-prompting LLM, indicating that our approach is more generalizable to different MT systems. We give examples demonstrating our approach's explainability power, where it shows which input source words have influence on a certain MT output word.
翻译:质量评估(QE)是预测机器翻译(MT)系统输出质量的任务,无需使用任何标准翻译参考。当前最先进的QE模型均为监督式:它们需要人工标注某些数据集上MT系统输出的质量来进行训练,这使得模型具有领域依赖性和MT系统依赖性。已有研究探索无监督QE方法,但这些方法需要玻璃盒访问MT系统,或需要平行MT数据来生成合成错误以训练QE模型。本文提出基于扰动的QE——一种词级质量评估方法,仅通过分析扰动输入源语句后MT系统的输出即可工作。我们的方法具有无监督、可解释的特点,能够评估任何类型的黑盒MT系统,包括当前主流且内部过程不透明的大语言模型(LLMs)。对于无标注QE数据的语言方向,我们的方法在WMT21共享任务上的表现与零样本监督方法相当或更优。相比监督式QE,我们的方法能更准确地检测翻译中的性别偏见和词义消歧错误,表明其对领域外应用的鲁棒性。在检测非传统翻译提示式LLM的错误时,性能差距更为显著,说明我们的方法对不同MT系统具有更好的泛化能力。我们给出示例展示方法的可解释性,能够揭示哪些输入源词对特定MT输出词产生影响。