Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at https://github.com/turkic-nlp/turkicnlp .
翻译:针对欧亚大陆超过两亿人使用的突厥语族的自然语言处理,其资源与工具仍处于碎片化状态,大多数语言缺乏统一的工具和资源。我们推出TurkicNLP,这是一个开源的Python库,为跨越四种文字体系(拉丁、西里尔、波斯-阿拉伯和古突厥如尼文)的突厥语族提供了一个单一、一致的NLP处理流程。该库通过一个语言无关的API,涵盖了分词、形态分析、词性标注、依存句法分析、命名实体识别、双向文字转写、跨语言句子嵌入以及机器翻译等功能。其模块化的多后端架构无缝集成了基于规则的有限状态转录器和神经模型,并具备自动文字检测和不同文字变体间的路由功能。输出遵循CoNLL-U标准,确保了完全的互操作性和可扩展性。代码和文档托管于 https://github.com/turkic-nlp/turkicnlp。