SiDiaC, the first comprehensive Sinhala Diachronic Corpus, covers a historical span from the 5th to the 20th century CE. SiDiaC comprises 58k words across 46 literary works, annotated carefully based on the written date, after filtering based on availability, authorship, copyright compliance, and data attribution. Texts from the National Library of Sri Lanka were digitised using Google Document AI OCR, followed by post-processing to correct formatting and modernise the orthography. The construction of SiDiaC was informed by practices from other corpora, such as FarPaHC, particularly in syntactic annotation and text normalisation strategies, due to the shared characteristics of low-resourced language status. This corpus is categorised based on genres into two layers: primary and secondary. Primary categorisation is binary, classifying each book into Non-Fiction or Fiction, while the secondary categorisation is more specific, grouping texts under Religious, History, Poetry, Language, and Medical genres. Despite challenges including limited access to rare texts and reliance on secondary date sources, SiDiaC serves as a foundational resource for Sinhala NLP, significantly extending the resources available for Sinhala, enabling diachronic studies in lexical change, neologism tracking, historical syntax, and corpus-based lexicography.
翻译:SiDiaC是首个全面的僧伽罗语历时语料库,其历史跨度涵盖公元5世纪至20世纪。该语料库包含46部文学作品中的58,000个单词,在根据可获得性、作者身份、版权合规性和数据归属进行筛选后,依据书面日期进行了仔细标注。来自斯里兰卡国家图书馆的文本使用Google Document AI OCR进行数字化,随后通过后处理以纠正格式并使正字法现代化。SiDiaC的构建借鉴了其他语料库(如FarPaHC)的实践,特别是在句法标注和文本规范化策略方面,这是由于它们共享低资源语言状态的特性。该语料库根据体裁分为两个层级:主要分类和次要分类。主要分类是二元的,将每本书归类为非虚构或虚构作品;而次要分类则更为具体,将文本归类于宗教、历史、诗歌、语言和医学等体裁。尽管面临包括稀有文本获取有限和依赖二手日期来源在内的挑战,SiDiaC仍可作为僧伽罗语自然语言处理的基础资源,显著扩展了僧伽罗语可用的资源,使得词汇变化、新词追踪、历史句法和基于语料库的词典编纂等历时研究成为可能。