This paper presents the winning submission of the RaaVa team to the AmericasNLP 2025 Shared Task 3 on Automatic Evaluation Metrics for Machine Translation (MT) into Indigenous Languages of America, where our system ranked first overall based on average Pearson correlation with the human annotations. We introduce Feature-Union Scorer (FUSE) for Evaluation, FUSE integrates Ridge regression and Gradient Boosting to model translation quality. In addition to FUSE, we explore five alternative approaches leveraging different combinations of linguistic similarity features and learning paradigms. FUSE Score highlights the effectiveness of combining lexical, phonetic, semantic, and fuzzy token similarity with learning-based modeling to improve MT evaluation for morphologically rich and low-resource languages. MT into Indigenous languages poses unique challenges due to polysynthesis, complex morphology, and non-standardized orthography. Conventional automatic metrics such as BLEU, TER, and ChrF often fail to capture deeper aspects like semantic adequacy and fluency. Our proposed framework, formerly referred to as FUSE, incorporates multilingual sentence embeddings and phonological encodings to better align with human evaluation. We train supervised models on human-annotated development sets and evaluate held-out test data. Results show that FUSE consistently achieves higher Pearson and Spearman correlations with human judgments, offering a robust and linguistically informed solution for MT evaluation in low-resource settings.
翻译:本文介绍了RaaVa团队在AmericasNLP 2025共享任务3(针对美洲土著语言的机器翻译自动评估指标)中的获胜方案,该系统基于与人工标注的平均皮尔逊相关性排名第一。我们提出了用于评估的特征联合评分器(FUSE),该模型整合了岭回归与梯度提升算法来建模翻译质量。除FUSE外,我们还探索了五种替代方法,利用不同组合的语言相似性特征与学习范式。FUSE评分凸显了将词汇、语音、语义及模糊词符相似性与基于学习的建模相结合的有效性,以改进对形态丰富且资源稀缺语言的机器翻译评估。由于多式综合、复杂形态及非标准化正字法等特点,针对土著语言的机器翻译面临独特挑战。传统自动评估指标如BLEU、TER和ChrF往往难以捕捉语义充分性与流畅性等深层特征。我们提出的框架(原称FUSE)融合了多语言句子嵌入与语音编码,以更好地对齐人工评估标准。我们在人工标注的开发集上训练监督模型,并对预留测试数据进行评估。结果表明,FUSE在皮尔逊与斯皮尔曼相关性上持续取得更高分数,为低资源环境下的机器翻译评估提供了稳健且具备语言学洞察的解决方案。