Research in NLP for Central Asian Turkic languages - Kazakh, Uzbek, Kyrgyz, and Turkmen - faces typical low-resource language challenges like data scarcity, limited linguistic resources and technology development. However, recent advancements have included the collection of language-specific datasets and the development of models for downstream tasks. Thus, this paper aims to summarize recent progress and identify future research directions. It provides a high-level overview of each language's linguistic features, the current technology landscape, the application of transfer learning from higher-resource languages, and the availability of labeled and unlabeled data. By outlining the current state, we hope to inspire and facilitate future research.
翻译:针对哈萨克语、乌兹别克语、吉尔吉斯语和土库曼语等中亚突厥语系的自然语言处理研究,面临着典型低资源语言的挑战,如数据稀缺、语言资源有限和技术发展不足。然而,近期的进展已包括语言专用数据集的收集以及面向下游任务模型的开发。因此,本文旨在总结最新进展并明确未来研究方向。文章从宏观层面概述了各语言的语言学特征、当前技术生态、从高资源语言进行迁移学习的应用情况,以及标注与非标注数据的可用性。通过勾勒现状,我们希望激励并促进未来的研究工作。