Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study

Machine Translation has made impressive progress in recent years offering close to human-level performance on many languages, but studies have primarily focused on high-resource languages with broad online presence and resources. With the help of growing Large Language Models, more and more low-resource languages achieve better results through the presence of other languages. However, studies have shown that not all low-resource languages can benefit from multilingual systems, especially those with insufficient training and evaluation data. In this paper, we revisit state-of-the-art Neural Machine Translation techniques to develop automatic translation systems between German and Bavarian. We investigate conditions of low-resource languages such as data scarcity and parameter sensitivity and focus on refined solutions that combat low-resource difficulties and creative solutions such as harnessing language similarity. Our experiment entails applying Back-translation and Transfer Learning to automatically generate more training data and achieve higher translation performance. We demonstrate noisiness in the data and present our approach to carry out text preprocessing extensively. Evaluation was conducted using combined metrics: BLEU, chrF and TER. Statistical significance results with Bonferroni correction show surprisingly high baseline systems, and that Back-translation leads to significant improvement. Furthermore, we present a qualitative analysis of translation errors and system limitations.

翻译：机器翻译近年来取得了显著进展，在许多语言上实现了接近人类水平的性能，但研究主要集中于网络资源丰富的高资源语言。随着大型语言模型的发展，越来越多的低资源语言借助其他语言的数据获得了更优结果。然而，研究表明，并非所有低资源语言都能从多语言系统中受益，尤其是那些缺乏训练和评估数据的语言。本文重新审视了最先进的神经机器翻译技术，旨在开发德语与巴伐利亚语之间的自动翻译系统。我们探究了低资源语言的典型问题，如数据稀缺性和参数敏感性，并聚焦于应对低资源困境的精细化解决方案，以及利用语言相似性等创新策略。实验中，我们采用回译和迁移学习来自动生成更多训练数据，从而提升翻译性能。我们发现了数据中的噪声问题，并提出了进行深度文本预处理的方法。评估采用组合指标：BLEU、chrF和TER。经Bonferroni校正后的统计显著性结果显示，基线系统表现惊人地高，且回译带来了显著改进。此外，我们还对翻译错误和系统局限性进行了定性分析。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日