An approach for mistranslation removal from popular dataset for Indic MT Task

The conversion of content from one language to another utilizing a computer system is known as Machine Translation (MT). Various techniques have come up to ensure effective translations that retain the contextual and lexical interpretation of the source language. End-to-end Neural Machine Translation (NMT) is a popular technique and it is now widely used in real-world MT systems. Massive amounts of parallel datasets (sentences in one language alongside translations in another) are required for MT systems. These datasets are crucial for an MT system to learn linguistic structures and patterns of both languages during the training phase. One such dataset is Samanantar, the largest publicly accessible parallel dataset for Indian languages (ILs). Since the corpus has been gathered from various sources, it contains many incorrect translations. Hence, the MT systems built using this dataset cannot perform to their usual potential. In this paper, we propose an algorithm to remove mistranslations from the training corpus and evaluate its performance and efficiency. Two Indic languages (ILs), namely, Hindi (HIN) and Odia (ODI) are chosen for the experiment. A baseline NMT system is built for these two ILs, and the effect of different dataset sizes is also investigated. The quality of the translations in the experiment is evaluated using standard metrics such as BLEU, METEOR, and RIBES. From the results, it is observed that removing the incorrect translation from the dataset makes the translation quality better. It is also noticed that, despite the fact that the ILs-English and English-ILs systems are trained using the same corpus, ILs-English works more effectively across all the evaluation metrics.

翻译：利用计算机系统将一种语言的内容转换为另一种语言的过程称为机器翻译。为确保翻译在保留源语言语境和词汇释义方面有效，已涌现出多种技术。端到端神经机器翻译（NMT）是一种流行技术，目前已广泛应用于实际机器翻译系统。机器翻译系统需要大量平行数据集（一种语言的句子及其对应的另一种语言的翻译），这些数据集对于系统在训练阶段学习两种语言的语言结构和模式至关重要。其中，Samanantar 是印度语言（ILs）中规模最大的公开平行数据集。由于该语料库来源于多种渠道，其中包含大量错误翻译，因此基于该数据集构建的机器翻译系统无法发挥其应有性能。本文提出了一种从训练语料中移除误译的算法，并评估了其性能与效率。实验选取了两种印度语言：印地语（HIN）和奥里亚语（ODI）。我们为这两种语言建立了基线NMT系统，并研究了不同数据集规模的影响。实验中的翻译质量采用标准评估指标进行评价，包括 BLEU、METEOR 和 RIBES。结果表明，从数据集中移除错误翻译能够提升翻译质量。同时观察到，尽管ILs-英语和英语-ILs系统使用相同的语料库训练，但ILs-英语系统在所有评估指标上均表现更优。