Behavioral testing in NLP allows fine-grained evaluation of systems by examining their linguistic capabilities through the analysis of input-output behavior. Unfortunately, existing work on behavioral testing in Machine Translation (MT) is currently restricted to largely handcrafted tests covering a limited range of capabilities and languages. To address this limitation, we propose to use Large Language Models (LLMs) to generate a diverse set of source sentences tailored to test the behavior of MT models in a range of situations. We can then verify whether the MT model exhibits the expected behavior through matching candidate sets that are also generated using LLMs. Our approach aims to make behavioral testing of MT systems practical while requiring only minimal human effort. In our experiments, we apply our proposed evaluation framework to assess multiple available MT systems, revealing that while in general pass-rates follow the trends observable from traditional accuracy-based metrics, our method was able to uncover several important differences and potential bugs that go unnoticed when relying only on accuracy.
翻译:自然语言处理中的行为测试通过分析系统的输入输出行为,细粒度评估其语言能力。然而,目前机器翻译领域的行为测试工作主要依赖于工构造的测试用例,覆盖的语言能力和场景范围有限。为解决这一局限,我们提出利用大型语言模型生成多样化的源语句,专门用于测试机器翻译模型在不同情境下的行为表现。随后,通过匹配同样由大型语言模型生成的候选集,可验证机器翻译模型是否展现预期行为。该方法旨在以最小化人工投入实现机器翻译系统的实用化行为测试。实验环节中,我们应用该评估框架对多个现有机器翻译系统进行评测,发现虽然整体通过率与传统基于准确率的指标趋势一致,但我们的方法能揭示若干传统准确率指标无法察觉的重要差异及潜在缺陷。