CANTONMT: Investigating Back-Translation and Model-Switch Mechanisms for Cantonese-English Neural Machine Translation

This paper investigates the development and evaluation of machine translation models from Cantonese to English, where we propose a novel approach to tackle low-resource language translations. The main objectives of the study are to develop a model that can effectively translate Cantonese to English and evaluate it against state-of-the-art commercial models. To achieve this, a new parallel corpus has been created by combining different available corpora online with preprocessing and cleaning. In addition, a monolingual Cantonese dataset has been created through web scraping to aid the synthetic parallel corpus generation. Following the data collection process, several approaches, including fine-tuning models, back-translation, and model switch, have been used. The translation quality of models has been evaluated with multiple quality metrics, including lexicon-based metrics (SacreBLEU and hLEPOR) and embedding-space metrics (COMET and BERTscore). Based on the automatic metrics, the best model is selected and compared against the 2 best commercial translators using the human evaluation framework HOPES. The best model proposed in this investigation (NLLB-mBART) with model switch mechanisms has reached comparable and even better automatic evaluation scores against State-of-the-art commercial models (Bing and Baidu Translators), with a SacreBLEU score of 16.8 on our test set. Furthermore, an open-source web application has been developed to allow users to translate between Cantonese and English, with the different trained models available for effective comparisons between models from this investigation and users. CANTONMT is available at https://github.com/kenrickkung/CantoneseTranslation

翻译：本文研究了粤语到英语的机器翻译模型的开发与评估，提出了一种应对低资源语言翻译的新方法。研究的主要目标是开发一个能有效进行粤英翻译的模型，并与当前最先进的商业模型进行对比评估。为此，我们通过整合在线可用的多种语料库，并经过预处理与清洗，构建了一个新的平行语料库。此外，通过网页爬取创建了单语粤语数据集，以辅助合成平行语料的生成。在数据收集完成后，采用了多种方法，包括模型微调、回译和模型切换。翻译质量通过多种评估指标进行评价，包括基于词汇的指标（SacreBLEU和hLEPOR）和基于嵌入空间的指标（COMET和BERTscore）。根据自动评估指标，选出了最佳模型，并利用人工评估框架HOPES与两款最优的商业翻译系统进行了对比。本研究提出的最佳模型（NLLB-mBART）结合模型切换机制，在自动评估得分上与最先进的商业模型（必应翻译和百度翻译）相当甚至更优，在我们的测试集上SacreBLEU得分为16.8。此外，我们还开发了一个开源网页应用程序，允许用户在粤语与英语之间进行翻译，并提供本研究中的不同训练模型，以便用户与本研究模型进行有效比较。CANTONMT于https://github.com/kenrickkung/CantoneseTranslation 公开获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/