Behavioral testing in NLP allows fine-grained evaluation of systems by examining their linguistic capabilities through the analysis of input-output behavior. Unfortunately, existing work on behavioral testing in Machine Translation (MT) is currently restricted to largely handcrafted tests covering a limited range of capabilities and languages. To address this limitation, we propose to use Large Language Models (LLMs) to generate a diverse set of source sentences tailored to test the behavior of MT models in a range of situations. We can then verify whether the MT model exhibits the expected behavior through matching candidate sets that are also generated using LLMs. Our approach aims to make behavioral testing of MT systems practical while requiring only minimal human effort. In our experiments, we apply our proposed evaluation framework to assess multiple available MT systems, revealing that while in general pass-rates follow the trends observable from traditional accuracy-based metrics, our method was able to uncover several important differences and potential bugs that go unnoticed when relying only on accuracy.
翻译:NLP中的行为测试通过分析输入输出行为来检验系统的语言能力,从而实现对系统的细粒度评估。遗憾的是,目前机器翻译领域的行为测试工作主要局限于手工构建的测试用例,覆盖的语言能力和语言种类范围有限。为解决这一局限,我们提出利用大语言模型生成多样化的源语句,用于测试机器翻译模型在各种场景下的行为。通过同样由大语言模型生成的候选匹配集,我们可以验证机器翻译模型是否展现出预期行为。该方法旨在以极小的人力投入实现机器翻译系统的实用化行为测试。在实验中,我们应用所提出的评估框架对多个现有机器翻译系统进行评测,结果表明:尽管总体通过率趋势与传统的基于准确率的指标观察结果一致,但我们的方法能够发现仅依赖准确率指标时容易被忽略的若干重要差异和潜在漏洞。