This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit with several improvements to its architecture. Compared to existing NLP tools for Hungarian, all of our pipelines feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing and named entity recognition with high accuracy and throughput. We thoroughly evaluated the proposed enhancements, compared the pipelines with state-of-the-art tools and demonstrated the competitive performance of the new models in all text preprocessing steps. All experiments are reproducible and the pipelines are freely available under a permissive license.
翻译:本文提出了一套面向匈牙利语的工业级文本处理模型,这些模型在平衡资源效率与准确性的同时,达到了接近当前最优的性能。上述模型基于spaCy框架实现,并通过多项架构改进扩展了HuSpaCy工具包。与现有匈牙利语自然语言处理工具相比,我们所有的流水线均涵盖完整的文本处理步骤,包括分词、句子边界检测、词性标注、形态特征标注、词形还原、依存句法分析和命名实体识别,兼具高精度与高吞吐量。我们全面评估了所提出的改进方案,将流水线与现有最优工具进行了对比,并在所有文本预处理步骤中证明了新模型的竞争优势。所有实验均可复现,且这些流水线在宽松许可下免费开放。