The recent advances in natural language processing have predominantly favored well-resourced English-centric models, resulting in a significant gap with low-resource languages. In this work, we introduce the language model TURNA, which is developed for the low-resource language Turkish and is capable of both natural language understanding and generation tasks. TURNA is pretrained with an encoder-decoder architecture based on the unified framework UL2 with a diverse corpus that we specifically curated for this purpose. We evaluated TURNA with three generation tasks and five understanding tasks for Turkish. The results show that TURNA outperforms several multilingual models in both understanding and generation tasks, and competes with monolingual Turkish models in understanding tasks. TURNA is made available at https://huggingface.co/boun-tabi-LMG/TURNA .
翻译:自然语言处理领域的最新进展主要偏向于资源丰富的英语中心模型,导致与低资源语言之间存在显著差距。本研究提出语言模型TURNA,该模型专为低资源语言土耳其语开发,能够同时处理自然语言理解与生成任务。TURNA采用基于统一框架UL2的编码器-解码器架构进行预训练,并使用了我们为此专门策划的多样化语料库。我们通过三项生成任务和五项理解任务对TURNA进行了评估。结果表明,TURNA在理解与生成任务中均优于多种多语言模型,并在理解任务中可与单语土耳其语模型相媲美。TURNA模型已在 https://huggingface.co/boun-tabi-LMG/TURNA 开放获取。