Mitrasamgraha: A Comprehensive Classical Sanskrit Machine Translation Dataset

While machine translation is regarded as a "solved problem" for many high-resource languages, close analysis quickly reveals that this is not the case for content that shows challenges such as poetic language, philosophical concepts, multi-layered metaphorical expressions, and more. Sanskrit literature is a prime example of this, as it combines a large number of such challenges in addition to inherent linguistic features like sandhi, compounding, and heavy morphology, which further complicate NLP downstream tasks. It spans multiple millennia of text production time as well as a large breadth of different domains, ranging from ritual formulas via epic narratives, philosophical treatises, poetic verses up to scientific material. As of now, there is a strong lack of publicly available resources that cover these different domains and temporal layers of Sanskrit. We therefore introduce Mitrasamgraha, a high-quality Sanskrit-to-English machine translation dataset consisting of 391,548 bitext pairs, more than four times larger than the largest previously available Sanskrit dataset Itih=asa. It covers a time period of more than three millennia and a broad range of historical Sanskrit domains. In contrast to web-crawled datasets, the temporal and domain annotation of this dataset enables fine-grained study of domain and time period effects on MT performance. We also release a validation set consisting of 5,587 and a test set consisting of 5,552 post-corrected bitext pairs. We conduct experiments benchmarking commercial and open models on this dataset and fine-tune NLLB and Gemma models on the dataset, showing significant improvements, while still recognizing significant challenges in the translation of complex compounds, philosophical concepts, and multi-layered metaphors. We also analyze how in-context learning on this dataset impacts the performance of commercial models

翻译：尽管机器翻译对于许多高资源语言被视为一个"已解决的问题"，但仔细分析很快就会发现，对于包含诗歌语言、哲学概念、多层隐喻表达等挑战性内容的情况，事实并非如此。梵语文献是这方面的一个典型例子，因为它不仅包含了大量此类挑战，还兼具连声、复合词构成和复杂的形态变化等固有的语言特征，这些特征进一步增加了自然语言处理下游任务的难度。其文本创作时间跨度超过三千年，涵盖领域极为广泛，从仪式咒语、史诗叙事、哲学论著、诗歌韵文到科学材料均有涉及。目前，公开可用的资源严重缺乏对这些不同领域和时代层次的梵语文本的覆盖。为此，我们推出了Mitrasamgraha，这是一个高质量的梵语-英语机器翻译数据集，包含391,548个双语文本对，规模超过此前最大的梵语数据集Itih=asa的四倍。它涵盖了超过三千年的时间跨度以及广泛的历史梵语领域。与网络爬取的数据集不同，该数据集的时间和领域标注支持对领域和时期影响机器翻译性能的细粒度研究。我们还发布了包含5,587个双语对的验证集和包含5,552个经后校正双语对的测试集。我们在此数据集上对商业和开源模型进行了基准实验，并对NLLB和Gemma模型进行了微调，结果显示性能有显著提升，但同时仍认识到在翻译复杂复合词、哲学概念和多层隐喻方面存在重大挑战。我们还分析了在此数据集上进行上下文学习对商业模型性能的影响。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

文档级神经机器翻译综述

专知会员服务

13+阅读 · 2024年8月29日

机器音译研究综述

专知会员服务

17+阅读 · 2022年10月16日

神经机器翻译的域自适应综述论文，64页pdf

专知会员服务

17+阅读 · 2021年4月16日

专知会员服务

30+阅读 · 2021年1月25日

稀缺资源语言神经网络机器翻译研究综述