We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline. We describe the main improvements in CLASSLA-Stanza with respect to Stanza, and give a detailed description of the model training process for the latest 2.1 release of the pipeline. We also report performance scores produced by the pipeline for different languages and varieties. CLASSLA-Stanza exhibits consistently high performance across all the supported languages and outperforms or expands its parent pipeline Stanza at all the supported tasks. We also present the pipeline's new functionality enabling efficient processing of web data and the reasons that led to its implementation.
翻译:我们提出了CLASSLA-Stanza——一个基于Stanza自然语言处理流水线的南斯拉夫语言自动化语言标注系统。本文详细阐述了CLASSLA-Stanza相较Stanza的主要改进,并完整描述了最新2.1版本流水线的模型训练过程。同时报告了该流水线在不同语言及方言变体上的性能评分。CLASSLA-Stanza在所有支持的语言上均展现出持续的高性能表现,并在所有支持任务中超越或扩展了其父流水线Stanza。此外,我们还介绍了流水线新增的网页数据高效处理功能及其实现背景。