We introduce GECKO, a bilingual large language model (LLM) optimized for Korean and English, along with programming languages. GECKO is pretrained on the balanced, high-quality corpus of Korean and English employing LLaMA architecture. In this report, we share the experiences of several efforts to build a better data pipeline for the corpus and to train our model. GECKO shows great efficiency in token generations for both Korean and English, despite its small size of vocabulary. We measure the performance on the representative benchmarks in terms of Korean, English and Code, and it exhibits great performance on KMMLU (Korean MMLU) and modest performance in English and Code, even with its smaller number of trained tokens compared to English-focused LLMs. GECKO is available to the open-source community under a permissive license. We hope our work offers a research baseline and practical insights for Korean LLM research. The model can be found at: https://huggingface.co/kifai/GECKO-7B
翻译:本文介绍GECKO——一个针对韩语、英语及编程语言优化的双语大语言模型(LLM)。GECKO基于LLaMA架构,在均衡的高质量韩英双语语料上进行预训练。本报告分享了我们在构建更优质数据管道和训练模型过程中的多项实践经验。尽管词汇量规模较小,GECKO在韩语和英语的token生成方面均表现出卓越效率。我们在韩语、英语及代码领域的代表性基准测试中评估其性能:相较于以英语为核心的LLM,GECKO在训练token数量较少的情况下,仍在KMMLU(韩语MMLU)上展现优异表现,在英语和代码任务中呈现中等水平。GECKO已基于宽松许可协议向开源社区开放。我们希望此项工作能为韩语LLM研究提供基准参考与实践启示。模型访问地址:https://huggingface.co/kifai/GECKO-7B