This paper presents the creation of initial bilingual corpora for thirteen very low-resource languages of India, all from Northeast India. It also presents the results of initial translation efforts in these languages. It creates the first-ever parallel corpora for these languages and provides initial benchmark neural machine translation results for these languages. We intend to extend these corpora to include a large number of low-resource Indian languages and integrate the effort with our prior work with African and American-Indian languages to create corpora covering a large number of languages from across the world.
翻译:本文介绍了为印度东北部十三种极低资源语言构建初始双语语料库的过程,并展示了这些语言初步翻译工作的成果。本研究首次为这些语言创建了平行语料库,并提供了神经机器翻译的初始基准结果。我们计划将这些语料库扩展至涵盖大量低资源印度语言,并与此前在非洲及美洲印第安语言领域的工作相结合,构建覆盖全球众多语言的语料库。