We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline. We describe the main improvements in CLASSLA-Stanza with respect to Stanza, and give a detailed description of the model training process for the latest 2.1 release of the pipeline. We also report performance scores produced by the pipeline for different languages and varieties. CLASSLA-Stanza exhibits consistently high performance across all the supported languages and outperforms or expands its parent pipeline Stanza at all the supported tasks. We also present the pipeline's new functionality enabling efficient processing of web data and the reasons that led to its implementation.
翻译:我们提出CLASSLA-Stanza,一个基于Stanza自然语言处理管道的自动化语言学标注流程,专为南斯拉夫语言设计。我们阐述了CLASSLA-Stanza相较于Stanza的主要改进,并详细描述了该管道最新2.1版本的模型训练过程。同时,我们报告了该管道在不同语言及方言变体上生成的性能评分。CLASSLA-Stanza在所有支持的语言上均表现出一致的高性能,并在所有支持的任务上超越或扩展了其父管道Stanza。我们还介绍了该管道的新功能,该功能实现了对网络数据的高效处理,并解释了促使其实现的原因。