We present Gamayun, a 1.5B-parameter multilingual language model trained entirely from scratch on 2.5T tokens. Designed for efficiency and deployment in resource-constrained environments, Gamayun addresses the lack of research on small non-English-centric LLMs by adopting a novel two-stage pre-training strategy: balanced multilingual training for cross-lingual alignment, followed by high-quality English enrichment to transfer performance gains across languages. Our model supports 12 languages, with special focus on Russian. Despite a significantly smaller training budget than comparable models, Gamayun outperforms LLaMA3.2-1B (9T tokens) on all considered benchmarks, and surpasses Qwen2.5-1.5B (18T tokens) on a wide range of English and multilingual tasks. It matches or exceeds Qwen3 (36T tokens) on most tasks outside advanced STEM, achieving state-of-the-art results in Russian, including the MERA benchmark, among the models of comparable size (1-2B parameters).
翻译:我们提出了Gamayun,这是一个完全从零开始、基于2.5万亿词元训练而成的15亿参数多语言模型。该模型专为资源受限环境下的高效部署而设计,通过采用创新的两阶段预训练策略——平衡多语言训练以实现跨语言对齐,随后进行高质量英语增强以将性能增益迁移至各语言——从而弥补了小型非英语中心化大型语言模型研究领域的空白。我们的模型支持12种语言,其中特别关注俄语。尽管训练预算显著低于同类模型,Gamayun在所有评估基准测试中均超越了LLaMA3.2-1B(9万亿词元),并在广泛的英语及多语言任务上超越了Qwen2.5-1.5B(18万亿词元)。在高级STEM领域之外的大多数任务中,其表现与Qwen3(36万亿词元)相当或更优,并在俄语任务中——包括MERA基准测试——取得了同类规模(10-20亿参数)模型中的最先进成果。