In this work we investigate the impact of applying textual data augmentation tasks to low resource machine translation. There has been recent interest in investigating approaches for training systems for languages with limited resources and one popular approach is the use of data augmentation techniques. Data augmentation aims to increase the quantity of data that is available to train the system. In machine translation, majority of the language pairs around the world are considered low resource because they have little parallel data available and the quality of neural machine translation (NMT) systems depend a lot on the availability of sizable parallel corpora. We study and apply three simple data augmentation techniques popularly used in text classification tasks; synonym replacement, random insertion and contextual data augmentation and compare their performance with baseline neural machine translation for English-Swahili (En-Sw) datasets. We also present results in BLEU, ChrF and Meteor scores. Overall, the contextual data augmentation technique shows some improvements both in the $EN \rightarrow SW$ and $SW \rightarrow EN$ directions. We see that there is potential to use these methods in neural machine translation when more extensive experiments are done with diverse datasets.
翻译:本研究探讨了文本数据增强任务对低资源机器翻译的影响。近年来,针对资源有限语言系统的训练方法研究备受关注,其中一种常用方法是数据增强技术。数据增强旨在扩充用于系统训练的数据量。在机器翻译领域,全球大多数语言对因缺乏足够的平行语料而被视为低资源,而神经机器翻译系统的性能很大程度上依赖于大规模平行语料库的可用性。我们研究并应用了文本分类任务中常用的三种简单数据增强技术:同义词替换、随机插入和上下文数据增强,并将其与英-斯瓦希里语(En-Sw)数据集的基线神经机器翻译进行性能对比。我们同时提供了BLEU、ChrF和Meteor评分结果。总体而言,上下文数据增强技术在$EN \rightarrow SW$和$SW \rightarrow EN$两个方向均展现出一定改进。研究表明,在更广泛的数据集上进行大规模实验后,这些方法在神经机器翻译中具有应用潜力。