We release S\={a}mayik, a dataset of around 53,000 parallel English-Sanskrit sentences, written in contemporary prose. Sanskrit is a classical language still in sustenance and has a rich documented heritage. However, due to the limited availability of digitized content, it still remains a low-resource language. Existing Sanskrit corpora, whether monolingual or bilingual, have predominantly focused on poetry and offer limited coverage of contemporary written materials. S\={a}mayik is curated from a diverse range of domains, including language instruction material, textual teaching pedagogy, and online tutorials, among others. It stands out as a unique resource that specifically caters to the contemporary usage of Sanskrit, with a primary emphasis on prose writing. Translation models trained on our dataset demonstrate statistically significant improvements when translating out-of-domain contemporary corpora, outperforming models trained on older classical-era poetry datasets. Finally, we also release benchmark models by adapting four multilingual pre-trained models, three of them have not been previously exposed to Sanskrit for translating between English and Sanskrit while one of them is multi-lingual pre-trained translation model including English and Sanskrit. The dataset and source code is present at https://github.com/ayushbits/saamayik.
翻译:我们发布了Sāmayik数据集,包含约53,000句当代散文体英梵平行句子。梵语作为仍在传承的古典语言,拥有丰富的文献遗产,但由于数字化资源有限,其仍属于低资源语言。现有梵语语料库(无论单语或双语)多聚焦于诗歌,对当代书面材料的覆盖有限。Sāmayik语料库涵盖语言教学材料、文本教学法及在线教程等多元领域,是专为当代梵语用法定制的独特资源,尤其强调散文书写。基于该数据集训练的翻译模型在跨领域当代语料翻译中展现出统计显著的提升,性能优于基于古典诗歌数据集训练的模型。最后,我们通过适配四种多语言预训练模型(其中三种此前未接触过梵语英译任务,另一种为包含英梵的多语言预训练翻译模型)发布了基准模型。数据集及源代码详见https://github.com/ayushbits/saamayik。