We present Kakugo, a novel and cost-effective pipeline designed to train general-purpose Small Language Models (SLMs) for low-resource languages using only the language name as input. By using a large teacher model to generate synthetic prompts and translate instruction datasets, we produced training data and SLMs for 54 low-resource languages. Evaluations across a diverse set of general natural language processing tasks, including translation, classification, and question answering, demonstrate that our pipeline consistently improves performance over base models. With a total generation and training cost of under $50 per language, Kakugo offers an accessible method for communities to develop language-specific AI.
翻译:本文提出Kakugo——一种新颖且经济高效的训练流程,仅需以语言名称作为输入,即可为低资源语言训练通用型小型语言模型(SLMs)。通过使用大型教师模型生成合成提示词并翻译指令数据集,我们为54种低资源语言生成了训练数据并训练出相应SLMs。在涵盖翻译、分类、问答等多样化通用自然语言处理任务上的评估表明,该流程能持续提升基础模型的性能。每种语言的总生成与训练成本低于50美元,Kakugo为各语言社区开发特定语言AI提供了可行路径。