Gamayun的多语言精通之路：以成本效益方式训练15亿参数大型语言模型 (Gamayun's Path to Multilingual Mastery: Cost-Efficient Training of a 1.5B-Parameter LLM)

Alexander Podolskiy,Semen Molokov,Timofey Gerasin,Maksim Titov,Alexey Rukhovich,Artem Khrapov,Kirill Morozov,Evgeny Tetin,Constantine Korikov,Pavel Efimov,Polina Lazukova,Yuliya Skripkar,Nikita Okhotnikov,Irina Piontkovskaya,Meng Xiaojun,Zou Xueyi,Zhang Zhenhe

We present Gamayun, a 1.5B-parameter multilingual language model trained entirely from scratch on 2.5T tokens. Designed for efficiency and deployment in resource-constrained environments, Gamayun addresses the lack of research on small non-English-centric LLMs by adopting a novel two-stage pre-training strategy: balanced multilingual training for cross-lingual alignment, followed by high-quality English enrichment to transfer performance gains across languages. Our model supports 12 languages, with special focus on Russian. Despite a significantly smaller training budget than comparable models, Gamayun outperforms LLaMA3.2-1B (9T tokens) on all considered benchmarks, and surpasses Qwen2.5-1.5B (18T tokens) on a wide range of English and multilingual tasks. It matches or exceeds Qwen3 (36T tokens) on most tasks outside advanced STEM, achieving state-of-the-art results in Russian, including the MERA benchmark, among the models of comparable size (1-2B parameters).

翻译：我们提出了Gamayun，这是一个完全从零开始、基于2.5万亿词元训练而成的15亿参数多语言模型。该模型专为资源受限环境下的高效部署而设计，通过采用创新的两阶段预训练策略——平衡多语言训练以实现跨语言对齐，随后进行高质量英语增强以将性能增益迁移至各语言——从而弥补了小型非英语中心化大型语言模型研究领域的空白。我们的模型支持12种语言，其中特别关注俄语。尽管训练预算显著低于同类模型，Gamayun在所有评估基准测试中均超越了LLaMA3.2-1B（9万亿词元），并在广泛的英语及多语言任务上超越了Qwen2.5-1.5B（18万亿词元）。在高级STEM领域之外的大多数任务中，其表现与Qwen3（36万亿词元）相当或更优，并在俄语任务中——包括MERA基准测试——取得了同类规模（10-20亿参数）模型中的最先进成果。