Word Closure-Based Metamorphic Testing for Machine Translation

With the wide application of machine translation, the testing of Machine Translation Systems (MTSs) has attracted much attention. Recent works apply Metamorphic Testing (MT) to address the oracle problem in MTS testing. Existing MT methods for MTS generally follow the workflow of input transformation and output relation comparison, which generates a follow-up input sentence by mutating the source input and compares the source and follow-up output translations to detect translation errors, respectively. These methods use various input transformations to generate test case pairs and have successfully triggered numerous translation errors. However, they have limitations in performing fine-grained and rigorous output relation comparison and thus may report many false alarms and miss many true errors. In this paper, we propose a word closure-based output comparison method to address the limitations of the existing MTS MT methods. We first propose word closure as a new comparison unit, where each closure includes a group of correlated input and output words in the test case pair. Word closures suggest the linkages between the appropriate fragment in the source output translation and its counterpart in the follow-up output for comparison. Next, we compare the semantics on the level of word closure to identify the translation errors. In this way, we perform a fine-grained and rigorous semantic comparison for the outputs and thus realize more effective violation identification. We evaluate our method with the test cases generated by five existing input transformations and the translation outputs from three popular MTSs. Results show that our method significantly outperforms the existing works in violation identification by improving the precision and recall and achieving an average increase of 29.9% in F1 score. It also helps to increase the F1 score of translation error localization by 35.9%.

翻译：随着机器翻译的广泛应用，机器翻译系统的测试已引起广泛关注。近期研究应用蜕变测试以解决机器翻译测试中的预言问题。现有的机器翻译蜕变测试方法通常遵循输入变换与输出关系比较的工作流程，即通过变异源输入生成后续输入句子，并分别比较源输出翻译与后续输出翻译以检测翻译错误。这些方法使用多种输入变换生成测试用例对，并已成功触发大量翻译错误。然而，它们在执行细粒度且严谨的输出关系比较方面存在局限，因此可能报告许多误报并遗漏大量真实错误。本文提出一种基于词闭包的输出比较方法，以解决现有机器翻译蜕变测试方法的局限性。我们首先提出将词闭包作为一种新的比较单元，其中每个闭包包含测试用例对中一组相关的输入词与输出词。词闭包指示了源输出翻译中的适当片段与其在后续输出中对应部分之间的关联，以进行比较。接着，我们在词闭包层面比较语义以识别翻译错误。通过这种方式，我们对输出执行细粒度且严谨的语义比较，从而实现更有效的违规识别。我们使用五种现有输入变换生成的测试用例以及三个主流机器翻译系统的翻译输出评估了所提方法。结果表明，该方法在违规识别方面显著优于现有工作，精确率与召回率均得到提升，F1分数平均提高29.9%。该方法还将翻译错误定位的F1分数提高了35.9%。