Word Closure-Based Metamorphic Testing for Machine Translation

With the wide application of machine translation, the testing of Machine Translation Systems (MTSs) has attracted much attention. Recent works apply Metamorphic Testing (MT) to address the oracle problem in MTS testing. Existing MT methods for MTS generally follow the workflow of input transformation and output relation comparison, which generates a follow-up input sentence by mutating the source input and compares the source and follow-up output translations to detect translation errors, respectively. These methods use various input transformations to generate test case pairs and have successfully triggered numerous translation errors. However, they have limitations in performing fine-grained and rigorous output relation comparison and thus may report false alarms and miss true errors. In this paper, we propose a word closure-based output comparison method to address the limitations of the existing MTS MT methods. Specifically, we first build a new comparison unit called word closure, where each closure includes a group of correlated input and output words in the test case pair. Word closures suggest the linkages between the appropriate fragment in the source output translation and its counterpart in the follow-up output for comparison. Next, we compare the semantics on the level of word closure to identify the translation errors. In this way, we perform a fine-grained and rigorous semantic comparison for the outputs and thus realize more effective violation identification. We evaluate our method with the test cases generated by five existing input transformations and translation outputs from three popular MTSs. Results show that our method significantly outperforms the existing works in violation identification by improving the precision and recall and achieving an average increase of 29.8% in F1 score. It also helps to increase the F1 score of translation error localization by 35.9%.

翻译：随着机器翻译的广泛应用，机器翻译系统（MTS）的测试受到广泛关注。近期研究采用蜕变测试（MT）来解决MTS测试中的先知问题。现有面向MTS的MT方法通常遵循输入变换与输出关系比较的工作流程，即通过对源输入句子进行变异生成后续输入句子，并分别比较源输出与后续输出翻译以检测翻译错误。这些方法采用多种输入变换生成测试用例对，已成功触发大量翻译错误。然而，它们在执行细粒度且严格的输出关系比较方面存在局限性，可能导致误报和漏报真实错误。本文提出一种基于词闭包（word closure）的输出比较方法，以解决现有MTS MT方法的局限性。具体而言，我们首先构建一种新的比较单元——词闭包，其中每个闭包包含测试用例对中一组相关的输入词和输出词。词闭包揭示了源输出翻译中适当片段与后续输出中对应片段之间的关联，以便进行比较。接着，我们在词闭包层面比较语义以识别翻译错误。通过这种方式，我们对输出执行细粒度且严格的语义比较，从而更有效地识别违规行为。我们使用五种现有输入变换生成的测试用例和三个主流MTS的翻译输出对所提方法进行评估。结果表明，在违规识别方面，我们的方法在精确率和召回率上显著优于现有工作，F1分数平均提升29.8%。同时，该方法还将翻译错误定位的F1分数提升35.9%。