The advent of deep learning has led to a significant gain in machine translation. However, most of the studies required a large parallel dataset which is scarce and expensive to construct and even unavailable for some languages. This paper presents a simple yet effective method to tackle this problem for low-resource languages by augmenting high-quality sentence pairs and training NMT models in a semi-supervised manner. Specifically, our approach combines the cross-entropy loss for supervised learning with KL Divergence for unsupervised fashion given pseudo and augmented target sentences derived from the model. We also introduce a SentenceBERT-based filter to enhance the quality of augmenting data by retaining semantically similar sentence pairs. Experimental results show that our approach significantly improves NMT baselines, especially on low-resource datasets with 0.46--2.03 BLEU scores. We also demonstrate that using unsupervised training for augmented data is more efficient than reusing the ground-truth target sentences for supervised learning.
翻译:深度学习的兴起极大地推动了机器翻译的发展。然而,大多数研究依赖于大规模平行数据集,这类数据不仅稀缺且构建成本高昂,甚至在某些语言中完全不可用。本文提出了一种简单但有效的方法来解决低资源语言的这一问题,通过增强高质量句对并以半监督方式训练NMT模型。具体而言,我们的方法结合了有监督学习的交叉熵损失与无监督学习中的KL散度,其中无监督部分基于模型生成的伪目标句和增强目标句。我们还引入了一个基于SentenceBERT的过滤器,通过保留语义相似的句对来提升增强数据的质量。实验结果表明,我们的方法显著改善了NMT基线性能,尤其在低资源数据集上实现了0.46–2.03 BLEU分的提升。此外,我们证明对增强数据进行无监督训练比直接使用真实目标句进行有监督学习更为高效。