The field of Natural Language Processing (NLP) has seen significant advancements with the development of Large Language Models (LLMs). However, much of this research remains focused on English, often overlooking low-resource languages like Korean. This oversight presents challenges due to the unique non-alphabetic token structure of Korean and the substantial memory and computational demands required for LLM training, which frequently lead to memory constraints and out-of-memory errors. To address these issues, we present RedWhale, a model specifically tailored for Korean language processing. RedWhale is developed using an efficient continual pretraining approach that includes a comprehensive Korean corpus preprocessing pipeline, a specialized tokenizer, an optimized model initialization technique, and a multistage pretraining strategy. These innovations collectively reduce training time and computational costs while maintaining high levels of accuracy and comprehension. By leveraging cross-lingual transfer learning, which exploits shared linguistic similarities across languages, RedWhale builds on English models to enhance Korean language processing. Experimental results demonstrate that RedWhale outperforms other leading models on Korean NLP benchmarks, including the Korean Balanced Evaluation of Significant Tasks (KoBEST), showing superior understanding and generation of Korean text. Furthermore, RedWhale showed no signs of convergence even after pretraining on 9.7 billion tokens, indicating the potential for further improvements with additional training. This work represents a significant advancement in bridging the linguistic divide, particularly in enhancing NLP capabilities for the Korean language.
翻译:自然语言处理领域随着大语言模型的发展取得了显著进步。然而,大部分研究仍集中于英语,往往忽视了韩语等低资源语言。由于韩语独特的非字母化分词结构以及大语言模型训练所需的大量内存和计算需求——这常常导致内存限制和内存溢出错误——这种忽视带来了挑战。为解决这些问题,我们提出了RedWhale,一个专门为韩语处理定制的模型。RedWhale采用高效的持续预训练方法开发,该方法包含一个全面的韩语语料库预处理流程、一个专用分词器、一种优化的模型初始化技术以及一个多阶段预训练策略。这些创新共同降低了训练时间和计算成本,同时保持了高水平的准确性和理解能力。通过利用跨语言迁移学习(该方法利用了语言间共享的语言学相似性),RedWhale在英语模型的基础上进行构建,以增强韩语处理能力。实验结果表明,在韩语自然语言处理基准测试中,包括韩语重要任务平衡评估,RedWhale的表现优于其他领先模型,显示出对韩语文本更优越的理解和生成能力。此外,RedWhale在预训练了97亿个分词后仍未显示出收敛迹象,这表明通过额外训练还有进一步改进的潜力。这项工作代表了在弥合语言鸿沟方面的重要进展,特别是在增强韩语的自然语言处理能力方面。