Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from the web-mined corpora. Prior research has demonstrated that ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) and training the NMT systems with the top-ranked samples, produces superior NMT performance than when trained using the full dataset. However, previous research has shown that the choice of multiPLM significantly impacts the ranking quality. This paper investigates the reasons behind this disparity across multiPLMs. Using the web-mined corpora CCMatrix and CCAligned for En$\rightarrow$Si, En$\rightarrow$Ta and Si$\rightarrow$Ta, we show that different multiPLMs (LASER3, XLM-R, and LaBSE) are biased towards certain types of sentences, which allows noisy sentences to creep into the top-ranked samples. We show that by employing a series of heuristics, this noise can be removed to a certain extent. This results in improving the results of NMT systems trained with web-mined corpora and reduces the disparity across multiPLMs.
翻译:平行数据筛选技术旨在从网络挖掘的语料库中滤除噪声平行句对。先前研究表明,使用基于预训练多语言语言模型生成的句向量相似度得分对句对进行排序,并选取排名靠前的样本训练神经机器翻译系统,相比使用完整数据集训练能获得更优的翻译性能。然而,已有研究指出多语言模型的选择会显著影响排序质量。本文深入探究了不同多语言模型产生性能差异的内在原因。通过分析英语-僧伽罗语、英语-泰米尔语及僧伽罗语-泰米尔语方向的网络挖掘语料库CCMatrix与CCAligned,我们发现不同多语言模型(LASER3、XLM-R和LaBSE)对特定类型句子存在偏好性偏差,导致噪声句对渗入高排名样本。我们证明通过采用系列启发式方法,可在一定程度上消除此类噪声。该方法不仅提升了基于网络挖掘语料库训练的神经机器翻译系统性能,同时减少了不同多语言模型间的性能差异。