Machine Translation (MT) is one of the most prominent tasks in Natural Language Processing (NLP) which involves the automatic conversion of texts from one natural language to another while preserving its meaning and fluency. Although the research in machine translation has been going on since multiple decades, the newer approach of integrating deep learning techniques in natural language processing has led to significant improvements in the translation quality. In this paper, we have developed a Neural Machine Translation (NMT) system by training the Transformer model to translate texts from Indian Language Hindi to English. Hindi being a low resource language has made it difficult for neural networks to understand the language thereby leading to a slow growth in the development of neural machine translators. Thus, to address this gap, we implemented back-translation to augment the training data and for creating the vocabulary, we experimented with both word and subword level tokenization using Byte Pair Encoding (BPE) thereby ending up training the Transformer in 10 different configurations. This led us to achieve a state-of-the-art BLEU score of 24.53 on the test set of IIT Bombay English-Hindi Corpus in one of the configurations.
翻译:机器翻译(MT)是自然语言处理(NLP)中最突出的任务之一,它涉及在保持意义和流畅性的前提下,将文本从一种自然语言自动转换为另一种自然语言。尽管机器翻译的研究已持续数十年,但将深度学习技术整合到自然语言处理中的新方法显著提升了翻译质量。本文通过训练 Transformer 模型开发了一个神经机器翻译(NMT)系统,用于将印度语言印地语翻译为英语。印地语作为一种低资源语言,使得神经网络难以理解该语言,从而导致神经机器翻译器的发展缓慢。为弥补这一不足,我们采用反向翻译技术来扩充训练数据,并在词汇构建过程中,使用字节对编码(BPE)实验了词级和子词级两种分词方法,最终以10种不同配置训练了 Transformer 模型。在其中一个配置下,我们在IIT Bombay英语-印地语语料库的测试集上取得了24.53的BLEU分数,达到了当前最优水平。