Behavioral testing offers a crucial means of diagnosing linguistic errors and assessing capabilities of NLP models. However, applying behavioral testing to machine translation (MT) systems is challenging as it generally requires human efforts to craft references for evaluating the translation quality of such systems on newly generated test cases. Existing works in behavioral testing of MT systems circumvent this by evaluating translation quality without references, but this restricts diagnosis to specific types of errors, such as incorrect translation of single numeric or currency words. In order to diagnose general errors, this paper proposes a new Bilingual Translation Pair Generation based Behavior Testing (BTPGBT) framework for conducting behavioral testing of MT systems. The core idea of BTPGBT is to employ a novel bilingual translation pair generation (BTPG) approach that automates the construction of high-quality test cases and their pseudoreferences. Experimental results on various MT systems demonstrate that BTPGBT could provide comprehensive and accurate behavioral testing results for general error diagnosis, which further leads to several insightful findings. Our code and data are available at https: //github.com/wujunjie1998/BTPGBT.
翻译:行为测试为诊断NLP模型的语言错误并评估其能力提供了一种关键手段。然而,将行为测试应用于机器翻译(MT)系统颇具挑战,因为通常需要人工构建参考译文来评估系统在新生成测试用例上的翻译质量。现有MT系统行为测试研究通过无参考译文评估翻译质量来规避这一问题,但这将诊断限制在特定错误类型上,例如单个数字或货币词汇的错误翻译。为诊断通用错误,本文提出一种新的基于双语翻译对生成的行为测试(BTPGBT)框架,用于对MT系统进行行为测试。BTPGBT的核心思想是采用新型双语翻译对生成(BTPG)方法,自动构建高质量测试用例及其伪参考译文。在多种MT系统上的实验结果表明,BTPGBT能为通用错误诊断提供全面且准确的行为测试结果,并由此得出若干富有洞见的结论。我们的代码与数据见https://github.com/wujunjie1998/BTPGBT。