Neural machine translation (NMT) for low-resource local languages in Indonesia faces significant challenges, including the need for a representative benchmark and limited data availability. This work addresses these challenges by comprehensively analyzing training NMT systems for four low-resource local languages in Indonesia: Javanese, Sundanese, Minangkabau, and Balinese. Our study encompasses various training approaches, paradigms, data sizes, and a preliminary study into using large language models for synthetic low-resource languages parallel data generation. We reveal specific trends and insights into practical strategies for low-resource language translation. Our research demonstrates that despite limited computational resources and textual data, several of our NMT systems achieve competitive performances, rivaling the translation quality of zero-shot gpt-3.5-turbo. These findings significantly advance NMT for low-resource languages, offering valuable guidance for researchers in similar contexts.
翻译:针对印度尼西亚低资源本地语言的神经机器翻译面临重大挑战,包括需要代表性基准以及数据可用性有限。本研究通过全面分析训练面向爪哇语、巽他语、米南加保语和巴厘语四种印尼低资源本地语言的神经机器翻译系统来应对这些挑战。我们的研究涵盖了多种训练方法、范式、数据规模,并初步探索了利用大型语言模型生成合成低资源语言平行数据。我们揭示了低资源语言翻译实用策略的具体趋势与洞见。研究表明,尽管计算资源和文本数据有限,我们训练的多个神经机器翻译系统仍能达到具有竞争力的性能,与零样本gpt-3.5-turbo的翻译质量相当。这些发现显著推动了低资源语言的神经机器翻译发展,为同类情境下的研究人员提供了宝贵指导。