The vast majority of evaluation metrics for machine translation are supervised, i.e., (i) are trained on human scores, (ii) assume the existence of reference translations, or (iii) leverage parallel data. This hinders their applicability to cases where such supervision signals are not available. In this work, we develop fully unsupervised evaluation metrics. To do so, we leverage similarities and synergies between evaluation metric induction, parallel corpus mining, and MT systems. In particular, we use an unsupervised evaluation metric to mine pseudo-parallel data, which we use to remap deficient underlying vector spaces (in an iterative manner) and to induce an unsupervised MT system, which then provides pseudo-references as an additional component in the metric. Finally, we also induce unsupervised multilingual sentence embeddings from pseudo-parallel data. We show that our fully unsupervised metrics are effective, i.e., they beat supervised competitors on 4 out of our 5 evaluation datasets. We make our code publicly available.
翻译:绝大多数机器翻译评估指标均为有监督方法,即:(i)基于人工评分训练,(ii)假设存在参考译文,或(iii)利用平行语料。这阻碍了它们在缺乏此类监督信号场景中的适用性。本研究开发了全无监督评估指标。为此,我们利用评估指标归纳、平行语料挖掘与机器翻译系统之间的相似性和协同效应。具体而言,我们采用无监督评估指标挖掘伪平行语料,通过迭代方式重新映射存在缺陷的底层向量空间,并诱导生成无监督机器翻译系统,该系统随后为指标提供伪参考作为附加组件。此外,我们还从伪平行语料中诱导生成无监督多语言句子嵌入。实验表明,我们的全无监督指标具有有效性——在5个评估数据集的4个中超越了有监督竞争者。我们已公开相关代码。