Neural Machine Translation (NMT) for low-resource languages is still a challenging task in front of NLP researchers. In this work, we deploy a standard data augmentation methodology by back-translation to a new language translation direction Cantonese-to-English. We present the models we fine-tuned using the limited amount of real data and the synthetic data we generated using back-translation including OpusMT, NLLB, and mBART. We carried out automatic evaluation using a range of different metrics including lexical-based and embedding-based. Furthermore. we create a user-friendly interface for the models we included in this\textsc{ CantonMT} research project and make it available to facilitate Cantonese-to-English MT research. Researchers can add more models into this platform via our open-source\textsc{ CantonMT} toolkit \url{https://github.com/kenrickkung/CantoneseTranslation}.
翻译:低资源语言的神经机器翻译(NMT)仍是自然语言处理研究者面临的一项挑战性任务。本研究将标准的反向翻译数据增强方法应用于一个新的翻译方向——粤语到英语。我们展示了利用有限真实数据以及通过反向翻译生成的合成数据(包括OpusMT、NLLB和mBART模型)进行微调的模型。我们采用了一系列自动评估指标进行评测,包括基于词汇和基于嵌入的度量方法。此外,我们为本研究项目\textsc{CantonMT}中所包含的模型创建了用户友好界面并对外开放,以促进粤语到英语机器翻译研究。研究者可通过我们的开源\textsc{CantonMT}工具包(\url{https://github.com/kenrickkung/CantoneseTranslation})向该平台添加更多模型。