Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from Sri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing. However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made in the field.
翻译:僧伽罗语是斯里兰卡主体民族僧伽罗人的母语,属于印欧语系这一全球性语系。然而,由于语言资本与经济资本的双重匮乏,从自然语言处理工具与研究的视角来看,僧伽罗语仍属于资源贫乏型语言:它既缺乏其近亲英语所具备的经济驱动力,也不具备汉语等语言凭借使用者数量形成的规模优势。斯里兰卡的多个研究团队已注意到这种匮乏状态,以及由此产生的对僧伽罗语自然语言处理专用工具与研究的迫切需求。但由于种种原因,这些尝试似乎缺乏相互间的协调与认知。本文旨在填补这一空白,对公开可用的僧伽罗语自然语言处理工具与研究进行全面文献综述,以帮助该领域研究者更好地利用同行的研究成果。为此,我们将把本文上传至arXiv平台,并建立持续更新机制,定期反映领域进展。