The number of open source language models that can produce Turkish is increasing day by day, as in other languages. In order to create the basic versions of such models, the training of multilingual models is usually continued with Turkish corpora. The alternative is to train the model with only Turkish corpora. In this study, we first introduce the cosmosGPT models that we created with this alternative method. Then, we introduce new finetune datasets for basic language models to fulfill user requests and new evaluation datasets for measuring the capabilities of Turkish language models. Finally, a comprehensive comparison of the adapted Turkish language models on different capabilities is presented. The results show that the language models we built with the monolingual corpus have promising performance despite being about 10 times smaller than the others.
翻译:摘要:与其他语言一样,能够生成土耳其语的开源语言模型数量日益增加。为了创建此类模型的基础版本,多语言模型的训练通常会继续使用土耳其语语料库。另一种替代方法是仅使用土耳其语语料库训练模型。在本研究中,我们首先介绍了通过这种替代方法创建的cosmosGPT模型。然后,我们为基础语言模型引入了新的微调数据集,以满足用户请求,并引入了新的评估数据集,用于衡量土耳其语言模型的能力。最后,我们对不同能力的适配土耳其语言模型进行了全面比较。结果表明,尽管我们使用单语语料库构建的语言模型规模比其他模型小约10倍,但其性能依然具有竞争力。