Quality Estimation (QE) is the task of evaluating the quality of a translation when reference translation is not available. The goal of QE aligns with the task of corpus filtering, where we assign the quality score to the sentence pairs present in the pseudo-parallel corpus. We propose a Quality Estimation based Filtering approach to extract high-quality parallel data from the pseudo-parallel corpus. To the best of our knowledge, this is a novel adaptation of the QE framework to extract quality parallel corpus from the pseudo-parallel corpus. By training with this filtered corpus, we observe an improvement in the Machine Translation (MT) system's performance by up to 1.8 BLEU points, for English-Marathi, Chinese-English, and Hindi-Bengali language pairs, over the baseline model. The baseline model is the one that is trained on the whole pseudo-parallel corpus. Our Few-shot QE model transfer learned from the English-Marathi QE model and fine-tuned on only 500 Hindi-Bengali training instances, shows an improvement of up to 0.6 BLEU points for Hindi-Bengali language pair, compared to the baseline model. This demonstrates the promise of transfer learning in the setting under discussion. QE systems typically require in the order of (7K-25K) of training data. Our Hindi-Bengali QE is trained on only 500 instances of training that is 1/40th of the normal requirement and achieves comparable performance. All the scripts and datasets utilized in this study will be publicly available.
翻译:质量估计(QE)是在无参考译文时评估翻译质量的任务。QE的目标与语料过滤任务一致,即对伪平行语料中的句对进行质量评分。我们提出一种基于质量估计的过滤方法,从伪平行语料中提取高质量平行数据。据我们所知,这是对QE框架的一种创新性适配,用于从伪平行语料中提取高质量平行语料。通过使用此过滤后的语料进行训练,我们观察到机器翻译(MT)系统的性能在英-马拉地语、中-英语和印-孟加拉语三个语言对上相较基线模型提升了最多1.8个BLEU值。基线模型是在完整伪平行语料上训练的模型。我们的少量样本QE模型通过从英-马拉地语QE模型进行迁移学习,并仅以500个印-孟加拉语训练实例进行微调,使得印-孟加拉语语言对相比基线模型获得了最高0.6个BLEU值的提升。这证明了迁移学习在上述场景中的潜力。QE系统通常需要约7K-25K规模的训练数据。而我们的印-孟加拉语QE仅使用500个训练实例(仅为常规需求的1/40),便取得了可比的性能。本研究中使用的所有脚本和数据集将公开提供。