Modern Translation Systems heavily rely on high-quality, large parallel datasets for state-of-the-art performance. However, such resources are largely unavailable for most of the South Asian languages. Among them, Nepali and Tamang fall into such category, with Tamang being among the least digitally resourced languages in the region. This work addresses the gap by developing NepTam20K, a 20K gold standard parallel corpus, and NepTam80K, an 80K synthetic Nepali-Tamang parallel corpus, both sentence-aligned and designed to support machine translation. The datasets were created through a pipeline involving data scraping from Nepali news and online sources, pre-processing, semantic filtering, balancing for tense and polarity (in NepTam20K dataset), expert translation into Tamang by native speakers of the language, and verification by an expert Tamang linguist. The dataset covers five domains: Agriculture, Health, Education and Technology, Culture, and General Communication. To evaluate the dataset, baseline machine translation experiments were carried out using various multilingual pre-trained models: mBART, M2M-100, NLLB-200, and a vanilla Transformer model. The fine-tuning on the NLLB-200 achieved the highest sacreBLEU scores of 40.92 (Nepali-Tamang) and 45.26 (Tamang-Nepali).
翻译:现代翻译系统严重依赖高质量、大规模平行数据集以实现最先进的性能。然而,对于大多数南亚语言而言,此类资源基本不可得。其中,尼泊尔语和塔芒语便属于此类,而塔芒语更是该地区数字资源最匮乏的语言之一。本研究通过构建NepTam20K(一个包含2万句的高质量黄金标准平行语料库)和NepTam80K(一个包含8万句的合成尼泊尔语-塔芒语平行语料库)来填补这一空白。两个语料库均为句子级对齐,专为支持机器翻译而设计。数据集的创建流程包括:从尼泊尔新闻和在线资源中爬取数据、预处理、语义过滤、时态和极性平衡(针对NepTam20K数据集)、由母语人士将内容翻译为塔芒语,并由塔芒语语言学专家进行验证。该数据集涵盖五大领域:农业、健康、教育与技术、文化以及通用交流。为评估数据集,我们使用多种多语言预训练模型(包括mBART、M2M-100、NLLB-200及基础Transformer模型)进行了基线机器翻译实验。其中,基于NLLB-200的微调取得了最高的sacreBLEU分数:尼泊尔语-塔芒语方向为40.92,塔芒语-尼泊尔语方向为45.26。