Etruscan is an ancient language spoken in Italy from the 7th century BC to the 1st century AD. There are no native speakers of the language at the present day, and its resources are scarce, as there exist only around 12,000 known inscriptions. To the best of our knowledge, there are no publicly available Etruscan corpora for natural language processing. Therefore, we propose a dataset for machine translation from Etruscan to English, which contains 2891 translated examples from existing academic sources. Some examples are extracted manually, while others are acquired in an automatic way. Along with the dataset, we benchmark different machine translation models observing that it is possible to achieve a BLEU score of 10.1 with a small transformer model. Releasing the dataset can help enable future research on this language, similar languages or other languages with scarce resources.
翻译:伊特拉斯坎语是一门古代语言,于公元前7世纪至公元1世纪在意大利使用。如今已无该语言的母语使用者,且其资源极为稀缺,目前仅存约12,000条已知铭文。据我们所知,目前尚无面向自然语言处理的公开伊特拉斯坎语语料库。为此,我们提出一个面向伊特拉斯坎语至英语机器翻译的数据集,该数据集包含来自现有学术文献的2,891条翻译范例。部分范例通过人工提取,其余则以自动化方式获取。除数据集之外,我们还在不同机器翻译模型上进行了基准测试,发现使用小型Transformer模型可实现10.1的BLEU分数。该数据集的发布将有助于推动对该语言、同类语言或其他资源稀缺语言的后续研究。