This report introduces \texttt{EEVE-Korean-v1.0}, a Korean adaptation of large language models that exhibit remarkable capabilities across English and Korean text understanding. Building on recent highly capable but English-centric LLMs, such as SOLAR-10.7B and Phi-2, where non-English texts are inefficiently processed with English-centric tokenizers, we present an efficient and effective vocabulary expansion (EEVE) method, which encompasses parameter freezing and subword initialization. In contrast to previous efforts that believe new embeddings require trillions of training tokens, we show that our method can significantly boost non-English proficiency within just 2 billion tokens. Surpassing most instruction-tuned LLMs on the Open Ko-LLM Leaderboard, as of January 2024, our model \texttt{EEVE-Korean-10.8B-v1.0} ranks as the leading Korean pre-trained model in the open-source community, according to Hugging Face's leaderboard. We open-source our models on Huggingface to empower the open research community in various languages.
翻译:本报告介绍了 \texttt{EEVE-Korean-v1.0},这是对具备卓越英语和韩语文本理解能力的大语言模型进行的韩语适配。基于近期功能强大但以英语为中心的大语言模型(如 SOLAR-10.7B 和 Phi-2),这些模型使用以英语为中心的标记器处理非英语文本时效率低下,我们提出了一种高效且有效的词汇扩展(EEVE)方法,该方法包括参数冻结和子词初始化。与先前认为新嵌入需要数万亿训练词元的努力不同,我们表明,我们的方法仅需 20 亿词元即可显著提升非英语能力。截至 2024 年 1 月,我们的模型 \texttt{EEVE-Korean-10.8B-v1.0} 在 Open Ko-LLM 排行榜上超越了大多数指令微调的大语言模型,根据 Hugging Face 的排行榜,它在开源社区中位列韩语预训练模型之首。我们在 Huggingface 上开源了我们的模型,以赋能各语言的开源研究社区。