The five idioms (i.e., varieties) of the Romansh language are largely standardized and are taught in the schools of the respective communities in Switzerland. In this paper, we present the first parallel corpus of Romansh idioms. The corpus is based on 291 schoolbook volumes, which are comparable in content for the five idioms. We use automatic alignment methods to extract 207k multi-parallel segments from the books, with more than 2M tokens in total. A small-scale human evaluation confirms that the segments are highly parallel, making the dataset suitable for NLP applications such as machine translation between Romansh idioms. We release the parallel and unaligned versions of the dataset under a CC-BY-NC-SA license and demonstrate its utility for machine translation by training and evaluating an LLM and a supervised multilingual MT model on the dataset.
翻译:罗曼什语的五种方言(即变体)在很大程度上已实现标准化,并在瑞士各社区的学校中教授。本文首次提出了罗曼什语方言的平行语料库。该语料库基于291册教材,其内容在五种方言间具有可比性。我们采用自动对齐方法从教材中提取了20.7万个多平行语段,总词元数超过200万。小规模人工评估证实这些语段具有高度平行性,使得该数据集适用于罗曼什语方言间的机器翻译等自然语言处理任务。我们以CC-BY-NC-SA许可协议发布了数据集的平行版本及未对齐版本,并通过在数据集上训练和评估LLM与有监督多语言机器翻译模型,验证了其在机器翻译任务中的实用性。