We introduce Solar Open, a 102B-parameter bilingual Mixture-of-Experts language model for underserved languages. Solar Open demonstrates a systematic methodology for building competitive LLMs by addressing three interconnected challenges. First, to train effectively despite data scarcity for underserved languages, we synthesize 4.5T tokens of high-quality, domain-specific, and RL-oriented data. Second, we coordinate this data through a progressive curriculum jointly optimizing composition, quality thresholds, and domain coverage across 20 trillion tokens. Third, to enable reasoning capabilities through scalable RL, we apply our proposed framework SnapPO for efficient optimization. Across benchmarks in English and Korean, Solar Open achieves competitive performance, demonstrating the effectiveness of this methodology for underserved language AI development.
翻译:本文介绍Solar Open,一个针对资源匮乏语言开发的1020亿参数双语专家混合语言模型。Solar Open通过解决三个相互关联的挑战,展示了构建具有竞争力大语言模型的系统化方法。首先,为解决资源匮乏语言数据稀缺问题,我们合成了4.5万亿个高质量、领域特定且强化学习导向的token。其次,我们通过渐进式课程学习协调这些数据,在20万亿token范围内联合优化数据构成、质量阈值和领域覆盖。第三,为实现可扩展强化学习的推理能力,我们应用提出的SnapPO框架进行高效优化。在英语和韩语的基准测试中,Solar Open均取得具有竞争力的性能,证明了该方法对资源匮乏语言人工智能发展的有效性。