The importance of qualitative parallel data in machine translation has long been determined but it has always been very difficult to obtain such in sufficient quantity for the majority of world languages, mainly because of the associated cost and also the lack of accessibility to these languages. Despite the potential for obtaining parallel datasets from online articles using automatic approaches, forensic investigations have found a lot of quality-related issues such as misalignment, and wrong language codes. In this work, we present a simple but qualitative parallel sentence aligner that carefully leveraged the closed-access Cohere multilingual embedding, a solution that ranked second in the just concluded #CoHereAIHack 2023 Challenge (see https://ai6lagos.devpost.com). The proposed approach achieved $94.96$ and $54.83$ f1 scores on FLORES and MAFAND-MT, compared to $3.64$ and $0.64$ of LASER respectively. Our method also achieved an improvement of more than 5 BLEU scores over LASER, when the resulting datasets were used with MAFAND-MT dataset to train translation models. Our code and data are available for research purposes here (https://github.com/abumafrim/Cohere-Align).
翻译:定性平行数据在机器翻译中的重要性早已被证实,但就全球大多数语言而言,获取足量的此类数据始终困难重重,主要原因在于相关成本高昂且这些语言的可及性不足。尽管可通过自动方法从网络文章获取平行数据集,但实证研究发现其中存在大量质量问题,如对齐错误和语言代码误标。本研究提出一种简洁但具备定性能力的平行句子对齐工具,该工具审慎利用了受限的Cohere多语言嵌入技术——在刚刚结束的#CoHereAIHack 2023挑战赛(见https://ai6lagos.devpost.com)中获得第二名的解决方案。所提方法在FLORES和MAFAND-MT数据集上分别达到$94.96$和$54.83$的F1分数,而LASER的对应结果仅为$3.64$和$0.64$。当将生成数据集与MAFAND-MT结合训练翻译模型时,我们的方法相比LASER还实现了超过5个BLEU值的提升。代码与数据已在(https://github.com/abumafrim/Cohere-Align)开源供研究使用。