Neural Machine Translation (NMT) for low-resource languages is still a challenging task in front of NLP researchers. In this work, we deploy a standard data augmentation methodology by back-translation to a new language translation direction Cantonese-to-English. We present the models we fine-tuned using the limited amount of real data and the synthetic data we generated using back-translation including OpusMT, NLLB, and mBART. We carried out automatic evaluation using a range of different metrics including lexical-based and embedding-based. Furthermore. we create a user-friendly interface for the models we included in this\textsc{ CantonMT} research project and make it available to facilitate Cantonese-to-English MT research. Researchers can add more models into this platform via our open-source\textsc{ CantonMT} toolkit \url{https://github.com/kenrickkung/CantoneseTranslation}.
翻译:神经机器翻译(NMT)对低资源语言而言仍是NLP研究者面临的挑战性任务。本研究将标准数据增强方法——回译技术——应用于全新的语言翻译方向:粤语到英语。我们展示了利用有限真实数据及通过回译生成的合成数据精调的模型,包括OpusMT、NLLB和mBART。我们采用包括基于词汇和基于嵌入的多种不同指标进行了自动评估。此外,我们为本《CantonMT》研究项目中包含的模型创建了用户友好型界面,并将其公开以促进粤语到英语机器翻译研究。研究者可通过我们的开源CantonMT工具包(https://github.com/kenrickkung/CantoneseTranslation)将更多模型添加至该平台。