Behavioral testing in NLP allows fine-grained evaluation of systems by examining their linguistic capabilities through the analysis of input-output behavior. Unfortunately, existing work on behavioral testing in Machine Translation (MT) is currently restricted to largely handcrafted tests covering a limited range of capabilities and languages. To address this limitation, we propose to use Large Language Models (LLMs) to generate a diverse set of source sentences tailored to test the behavior of MT models in a range of situations. We can then verify whether the MT model exhibits the expected behavior through matching candidate sets that are also generated using LLMs. Our approach aims to make behavioral testing of MT systems practical while requiring only minimal human effort. In our experiments, we apply our proposed evaluation framework to assess multiple available MT systems, revealing that while in general pass-rates follow the trends observable from traditional accuracy-based metrics, our method was able to uncover several important differences and potential bugs that go unnoticed when relying only on accuracy.
翻译:自然语言处理中的行为测试通过分析输入输出行为来评估系统的语言能力,从而实现对系统细粒度的评测。然而,现有机翻译(MT)行为测试工作主要依赖手工构造的测试集,覆盖的语言能力和语种范围十分有限。为解决这一局限,我们提出利用大语言模型(LLM)根据特定场景生成多样化的源语言句子集合,以测试机器翻译模型的行为。通过同样由LLM生成的候选项集合,可以验证MT模型是否展现出预期行为。该方法旨在以最小人工干预实现实用的MT系统行为测试。实验中,我们采用所提出的评估框架对多个现有MT系统进行评测,结果表明:虽然整体通过率与基于传统准确率的指标趋势一致,但该方法能发现仅依赖准确率指标难以察觉的重要差异与潜在缺陷。