Machine Translation has made impressive progress in recent years offering close to human-level performance on many languages, but studies have primarily focused on high-resource languages with broad online presence and resources. With the help of growing Large Language Models, more and more low-resource languages achieve better results through the presence of other languages. However, studies have shown that not all low-resource languages can benefit from multilingual systems, especially those with insufficient training and evaluation data. In this paper, we revisit state-of-the-art Neural Machine Translation techniques to develop automatic translation systems between German and Bavarian. We investigate conditions of low-resource languages such as data scarcity and parameter sensitivity and focus on refined solutions that combat low-resource difficulties and creative solutions such as harnessing language similarity. Our experiment entails applying Back-translation and Transfer Learning to automatically generate more training data and achieve higher translation performance. We demonstrate noisiness in the data and present our approach to carry out text preprocessing extensively. Evaluation was conducted using combined metrics: BLEU, chrF and TER. Statistical significance results with Bonferroni correction show surprisingly high baseline systems, and that Back-translation leads to significant improvement. Furthermore, we present a qualitative analysis of translation errors and system limitations.
翻译:机器翻译近年来取得了显著进展,在许多语言上实现了接近人类水平的性能,但研究主要集中于网络资源丰富的高资源语言。随着大型语言模型的发展,越来越多的低资源语言借助其他语言的数据获得了更优结果。然而,研究表明,并非所有低资源语言都能从多语言系统中受益,尤其是那些缺乏训练和评估数据的语言。本文重新审视了最先进的神经机器翻译技术,旨在开发德语与巴伐利亚语之间的自动翻译系统。我们探究了低资源语言的典型问题,如数据稀缺性和参数敏感性,并聚焦于应对低资源困境的精细化解决方案,以及利用语言相似性等创新策略。实验中,我们采用回译和迁移学习来自动生成更多训练数据,从而提升翻译性能。我们发现了数据中的噪声问题,并提出了进行深度文本预处理的方法。评估采用组合指标:BLEU、chrF和TER。经Bonferroni校正后的统计显著性结果显示,基线系统表现惊人地高,且回译带来了显著改进。此外,我们还对翻译错误和系统局限性进行了定性分析。