SiDiaC-v.2.0 is the largest comprehensive Sinhala Diachronic Corpus to date, covering a period from 1800 CE to 1955 CE in terms of publication dates, and a historical span from the 5th to the 20th century CE in terms of written dates. The corpus consists of 244k words across 185 literary works that underwent thorough filtering, preprocessing, and copyright compliance checks, followed by extensive post-processing. Additionally, a subset of 59 documents totalling 70k words was annotated based on their written dates. Texts from the National Library of Sri Lanka were selected from the SiDiaC-v.1.0 non-filtered list, which was digitised using Google Document AI OCR. This was followed by post-processing to correct formatting issues, address code-mixing, include special tokens, and fix malformed tokens. The construction of SiDiaC-v.2.0 was informed by practices from other corpora, such as FarPaHC, SiDiaC-v.1.0, and CCOHA. This was particularly relevant for syntactic annotation and text normalisation strategies, given the shared characteristics of low-resource language status between Faroese and the similar cleaning strategies utilised in CCOHA. This corpus is categorised into two layers based on genres: primary and secondary. The primary categorisation is binary, assigning each book to either Non-Fiction or Fiction. The secondary categorisation is more detailed, grouping texts under specific genres such as Religious, History, Poetry, Language, and Medical. Despite facing challenges due to limited resources, SiDiaC-v.2.0 serves as a comprehensive resource for Sinhala NLP, building upon the work previously done in SiDiaC-v.1.0.
翻译:SiDiaC-v.2.0 是迄今为止规模最大、内容最全面的僧伽罗语历时语料库,其收录文献的出版日期覆盖公元1800年至1955年,而文本的书写年代则跨越了公元5世纪至20世纪。该语料库包含185部文学作品,共计24.4万词,所有文本均经过了严格的筛选、预处理和版权合规性检查,并进行了大量的后处理工作。此外,我们还根据书写年代对其中59份文档(总计7万词)进行了子集标注。文本选自斯里兰卡国家图书馆的SiDiaC-v.1.0未过滤列表,这些文本最初通过Google Document AI OCR进行数字化。随后通过后处理来修正格式问题、处理语码混合、纳入特殊标记并修复畸形标记。SiDiaC-v.2.0的构建借鉴了其他语料库(如FarPaHC、SiDiaC-v.1.0和CCOHA)的实践经验。考虑到法罗语与僧伽罗语同为低资源语言,以及CCOHA中采用的类似清洗策略,这些经验在句法标注和文本规范化策略方面尤其具有参考价值。本语料库根据体裁分为两个层级:主要分类和次要分类。主要分类是二元的,将每部书籍归类为非虚构类或虚构类。次要分类则更为详细,将文本按特定体裁分组,例如宗教、历史、诗歌、语言和医学。尽管面临资源有限的挑战,SiDiaC-v.2.0在先前SiDiaC-v.1.0工作的基础上,为僧伽罗语自然语言处理研究提供了一个全面的资源。