The Serbian language is a Slavic language spoken by over 12 million speakers and well understood by over 15 million people. In the area of natural language processing, it can be considered a low-resourced language. Also, Serbian is considered a high-inflectional language. The combination of many word inflections and low availability of language resources makes natural language processing of Serbian challenging. Nevertheless, over the past three decades, there have been a number of initiatives to develop resources and methods for natural language processing of Serbian, ranging from developing a corpus of free text from books and the internet, annotated corpora for classification and named entity recognition tasks to various methods and models performing these tasks. In this paper, we review the initiatives, resources, methods, and their availability.
翻译:塞尔维亚语是一种斯拉夫语言,拥有超过1200万母语使用者,且被逾1500万人广泛理解。在自然语言处理领域,该语言可被视为低资源语言。同时,塞尔维亚语属于高度屈折语。大量词形变化与语言资源匮乏的双重特性,使得塞尔维亚语的自然语言处理极具挑战性。然而在过去的三十年间,学界已涌现众多针对塞尔维亚语自然语言处理的资源与方法开发计划——从基于书籍与互联网的纯文本语料库构建、面向分类与命名实体识别任务的标注语料建立,到执行这些任务的各种方法与模型。本文将对上述计划、资源、方法及其可获取性进行系统综述。