Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce contexts. This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings. We examine how each metric responds to translation artifacts, including hallucinations, repetition, source-text copying, and diacritic (\textit{matra}) variations across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi, with a focus on outputs from large language models (LLMs) and neural MT (NMT) systems. While recent work often relies solely on ChrF++, our findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.
翻译:在极低资源语言场景中评估机器翻译质量面临独特挑战,因为BLEU等在高资源环境中有效的常用指标,在数据稀缺情境下常常无法准确反映质量。本研究针对极低资源语言环境下的机器翻译评估,对基于n元语法的BLEU指标与基于字符的ChrF++指标进行了对比分析。我们以大型语言模型和神经机器翻译系统的输出为重点,考察了每种指标对三种极低资源语言(马加希语、博杰普尔语和恰蒂斯加尔语)中翻译伪影(包括幻觉、重复、源文本复制及变音符号(\textit{matra})变异)的响应特性。尽管近期研究常单独依赖ChrF++,但我们的研究结果表明,BLEU尽管绝对分数较低,却能提供补充性的词汇精确度洞察,从而提升结果的可解释性。