Hindi to English: Transformer-Based Neural Machine Translation

Machine Translation (MT) is one of the most prominent tasks in Natural Language Processing (NLP) which involves the automatic conversion of texts from one natural language to another while preserving its meaning and fluency. Although the research in machine translation has been going on since multiple decades, the newer approach of integrating deep learning techniques in natural language processing has led to significant improvements in the translation quality. In this paper, we have developed a Neural Machine Translation (NMT) system by training the Transformer model to translate texts from Indian Language Hindi to English. Hindi being a low resource language has made it difficult for neural networks to understand the language thereby leading to a slow growth in the development of neural machine translators. Thus, to address this gap, we implemented back-translation to augment the training data and for creating the vocabulary, we experimented with both word and subword level tokenization using Byte Pair Encoding (BPE) thereby ending up training the Transformer in 10 different configurations. This led us to achieve a state-of-the-art BLEU score of 24.53 on the test set of IIT Bombay English-Hindi Corpus in one of the configurations.

翻译：机器翻译（MT）是自然语言处理（NLP）中最突出的任务之一，它涉及在保持意义和流畅性的前提下，将文本从一种自然语言自动转换为另一种自然语言。尽管机器翻译的研究已持续数十年，但将深度学习技术整合到自然语言处理中的新方法显著提升了翻译质量。本文通过训练 Transformer 模型开发了一个神经机器翻译（NMT）系统，用于将印度语言印地语翻译为英语。印地语作为一种低资源语言，使得神经网络难以理解该语言，从而导致神经机器翻译器的发展缓慢。为弥补这一不足，我们采用反向翻译技术来扩充训练数据，并在词汇构建过程中，使用字节对编码（BPE）实验了词级和子词级两种分词方法，最终以10种不同配置训练了 Transformer 模型。在其中一个配置下，我们在IIT Bombay英语-印地语语料库的测试集上取得了24.53的BLEU分数，达到了当前最优水平。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日