伽马云的跨语言精通之路：以成本效益方式训练15亿参数大语言模型 (Gamayun's Path to Multilingual Mastery: Cost-Efficient Training of a 1.5B-Parameter LLM) - 专知论文

会员服务 ·

0

马云（人物） · 词元 · 跨语言 · 语言模型 · 基准 ·

2025 年 12 月 25 日

Gamayun's Path to Multilingual Mastery: Cost-Efficient Training of a 1.5B-Parameter LLM

翻译：伽马云的跨语言精通之路：以成本效益方式训练15亿参数大语言模型

Alexander Podolskiy,Semen Molokov,Timofey Gerasin,Maksim Titov,Alexey Rukhovich,Artem Khrapov,Kirill Morozov,Evgeny Tetin,Constantine Korikov,Pavel Efimov,Polina Lazukova,Yuliya Skripkar,Nikita Okhotnikov,Irina Piontkovskaya,Meng Xiaojun,Zou Xueyi,Zhang Zhenhe

We present Gamayun, a 1.5B-parameter multilingual language model trained entirely from scratch on 2.5T tokens. Designed for efficiency and deployment in resource-constrained environments, Gamayun addresses the lack of research on small non-English-centric LLMs by adopting a novel two-stage pre-training strategy: balanced multilingual training for cross-lingual alignment, followed by high-quality English enrichment to transfer performance gains across languages. Our model supports 12 languages, with special focus on Russian. Despite a significantly smaller training budget than comparable models, Gamayun outperforms LLaMA3.2-1B (9T tokens) on all considered benchmarks, and surpasses Qwen2.5-1.5B (18T tokens) on a wide range of English and multilingual tasks. It matches or exceeds Qwen3 (36T tokens) on most tasks outside advanced STEM, achieving state-of-the-art results in Russian, including the MERA benchmark, among the models of comparable size (1-2B parameters).

翻译：我们提出伽马云，一个完全从零开始、基于2.5万亿词元训练而成的15亿参数多语言大语言模型。该模型专为资源受限环境下的高效部署而设计，通过采用一种新颖的两阶段预训练策略——先进行平衡的多语言训练以实现跨语言对齐，随后进行高质量英语数据增强以将性能增益迁移至各语言——从而弥补了小型非英语中心大语言模型研究领域的不足。我们的模型支持12种语言，并特别聚焦于俄语。尽管训练预算显著低于同类模型，伽马云在所有考量基准测试中均优于LLaMA3.2-1B（9万亿词元），并在广泛的英语及多语言任务上超越了Qwen2.5-1.5B（18万亿词元）。在高级STEM领域之外的大多数任务中，其表现与Qwen3（36万亿词元）相当或更优，并在俄语任务上取得了同类规模模型（10-20亿参数）中的最佳性能，包括在MERA基准测试中。

0

相关内容

马云（人物）

马云（人物）

马云（1964年－），浙江省杭州市人。阿里巴巴集团主要创始人， 2013年5月卸任CEO，现任阿里巴巴集团董事局主席。

【NVDIA】Cosmos世界基础模型平台用于物理人工智能

【NVDIA】Cosmos世界基础模型平台用于物理人工智能

专知会员服务

27+阅读 · 2025年1月13日

Jakub Tomczak- 《深度生成建模》讲座报告与视频，84页ppt，Deep Generative Modeling is a key to unlocking AI potential

Jakub Tomczak- 《深度生成建模》讲座报告与视频，84页ppt，Deep Generative Modeling is a key to unlocking AI potential

专知会员服务

61+阅读 · 2022年3月11日

【简明书】机器学习用例书册，76页pdf，The Big Book of Machine Learning Use Cases

【简明书】机器学习用例书册，76页pdf，The Big Book of Machine Learning Use Cases

专知会员服务

67+阅读 · 2021年12月22日

【ICML2021】全局思考，局部行动:高维分类和混合搜索空间上的贝叶斯优化

专知会员服务

28+阅读 · 2021年5月11日

最新《自然语言处理迁移学习》综述论文，A Survey on Transfer Learning in Natural Language Processing

最新《自然语言处理迁移学习》综述论文，A Survey on Transfer Learning in Natural Language Processing

专知会员服务

140+阅读 · 2020年7月10日

【Google 大脑】使用上千个优化任务学习超参数搜索策略，Using a thousand optimization tasks to learn hyperparameter search strategies

【Google 大脑】使用上千个优化任务学习超参数搜索策略，Using a thousand optimization tasks to learn hyperparameter search strategies

专知会员服务

18+阅读 · 2020年3月14日

【2020新书】企业级机器学习: Spark XGBoost LightGBM, NLP, Keras深度学习, 367页pdf

【2020新书】企业级机器学习: Spark XGBoost LightGBM, NLP, Keras深度学习, 367页pdf

专知会员服务

115+阅读 · 2020年2月24日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

【《Scikit-Learn、Keras与TensorFlow机器学习实用指南(第二版)》电子书与代码(Notebooks)】Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition

【《Scikit-Learn、Keras与TensorFlow机器学习实用指南(第二版)》电子书与代码(Notebooks)】Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition

专知会员服务

219+阅读 · 2019年12月18日

【《图解深度学习》电子书与代码，830页pdf】’Deep Learning Illustrated (2019)' by Deep Learning Study Group GitHub

【《图解深度学习》电子书与代码，830页pdf】’Deep Learning Illustrated (2019)' by Deep Learning Study Group GitHub

专知会员服务

153+阅读 · 2019年1月1日

超50篇论文串联起从VQA到多模态预训练大模型的前世今生—Part 1

超50篇论文串联起从VQA到多模态预训练大模型的前世今生—Part 1

PaperWeekly

16+阅读 · 2022年4月29日

华为诺亚方舟预训练语言模型NEZHA、TinyBERT开源代码

华为诺亚方舟预训练语言模型NEZHA、TinyBERT开源代码

专知

17+阅读 · 2019年12月7日

计算机视觉方向简介 | 基于自然语言的跨模态行人re-id的SOTA方法（上）

计算机视觉方向简介 | 基于自然语言的跨模态行人re-id的SOTA方法（上）

计算机视觉life

12+阅读 · 2019年6月29日

MIT高赞深度学习教程：一文看懂CNN、RNN等7种范例（TensorFlow教程）

MIT高赞深度学习教程：一文看懂CNN、RNN等7种范例（TensorFlow教程）

全球人工智能

10+阅读 · 2019年5月5日

IBM-小样本学习（Few-shot Learning）State of the art 方法及论文讲解

IBM-小样本学习（Few-shot Learning）State of the art 方法及论文讲解

专知

105+阅读 · 2019年4月15日

小样本学习（Few-shot Learning）综述

小样本学习（Few-shot Learning）综述

云栖社区

22+阅读 · 2019年4月6日

ECCV2018教程146页《对抗机器学习》PPT教程（附PPT下载）

ECCV2018教程146页《对抗机器学习》PPT教程（附PPT下载）

专知

21+阅读 · 2018年9月7日

CMU大学76页深度学习课程：变分自编码器（VAE, Variational Autoencoder）

CMU大学76页深度学习课程：变分自编码器（VAE, Variational Autoencoder）

专知

28+阅读 · 2018年8月15日

机器翻译新时代：Facebook 开源无监督机器翻译模型和大规模训练语料

机器翻译新时代：Facebook 开源无监督机器翻译模型和大规模训练语料

机器学习研究会

12+阅读 · 2017年12月24日

TextInfoExp:自然语言处理相关实验（基于sougou数据集）

TextInfoExp:自然语言处理相关实验（基于sougou数据集）

全球人工智能

12+阅读 · 2017年11月12日

基于复杂图知识表示的终身强化学习研究

国家自然科学基金

39+阅读 · 2015年12月31日

基于深层特征学习的RGB-D人体行为识别方法

国家自然科学基金

4+阅读 · 2015年12月31日

基于高斯过程模型的多示例多标记学习算法研究

国家自然科学基金

14+阅读 · 2015年12月31日

面向大规模分布式一致性最优化问题的结构型一阶求解算法研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于犹豫模糊语言信息的定性决策理论与方法

国家自然科学基金

2+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

天气影响下基于损失厌恶的“公司+农户”型农产品供应链风险管理研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于模型驱动的并发建模语言Apla+设计及其可靠性研究

国家自然科学基金

3+阅读 · 2014年12月31日

面向汉语文本理解的语义计算方法

国家自然科学基金

8+阅读 · 2014年12月31日

基于Markov博弈的计算机网络对抗行动策略分析与建模研究

国家自然科学基金

17+阅读 · 2013年12月31日

PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use

Arxiv

0+阅读 · 1月28日

Mixture-of-Models: Unifying Heterogeneous Agents via N-Way Self-Evaluating Deliberation

Arxiv

0+阅读 · 1月23日

PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning

Arxiv

0+阅读 · 1月21日

Kakugo: Distillation of Low-Resource Languages into Small Language Models

Arxiv

0+阅读 · 1月20日

ArchAgent: Scalable Legacy Software Architecture Recovery with LLMs

Arxiv

0+阅读 · 1月19日

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

Arxiv

0+阅读 · 1月19日

SAMannot: A Memory-Efficient, Local, Open-source Framework for Interactive Video Instance Segmentation based on SAM2

Arxiv

0+阅读 · 1月16日

The Conversational Exam: A Scalable Assessment Design for the AI Era

Arxiv

0+阅读 · 1月15日

Contextual StereoSet: Stress-Testing Bias Alignment Robustness in Large Language Models

Arxiv

0+阅读 · 1月15日

Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought

Arxiv

0+阅读 · 1月13日

VIP会员

文章信息

相关主题

马云（人物）

相关VIP内容

【NVDIA】Cosmos世界基础模型平台用于物理人工智能

【NVDIA】Cosmos世界基础模型平台用于物理人工智能

专知会员服务

27+阅读 · 2025年1月13日

Jakub Tomczak- 《深度生成建模》讲座报告与视频，84页ppt，Deep Generative Modeling is a key to unlocking AI potential

Jakub Tomczak- 《深度生成建模》讲座报告与视频，84页ppt，Deep Generative Modeling is a key to unlocking AI potential

专知会员服务

61+阅读 · 2022年3月11日

【简明书】机器学习用例书册，76页pdf，The Big Book of Machine Learning Use Cases

【简明书】机器学习用例书册，76页pdf，The Big Book of Machine Learning Use Cases

专知会员服务

67+阅读 · 2021年12月22日

【ICML2021】全局思考，局部行动:高维分类和混合搜索空间上的贝叶斯优化

专知会员服务

28+阅读 · 2021年5月11日

最新《自然语言处理迁移学习》综述论文，A Survey on Transfer Learning in Natural Language Processing

最新《自然语言处理迁移学习》综述论文，A Survey on Transfer Learning in Natural Language Processing

专知会员服务

140+阅读 · 2020年7月10日

【Google 大脑】使用上千个优化任务学习超参数搜索策略，Using a thousand optimization tasks to learn hyperparameter search strategies

【Google 大脑】使用上千个优化任务学习超参数搜索策略，Using a thousand optimization tasks to learn hyperparameter search strategies

专知会员服务

18+阅读 · 2020年3月14日

【2020新书】企业级机器学习: Spark XGBoost LightGBM, NLP, Keras深度学习, 367页pdf

【2020新书】企业级机器学习: Spark XGBoost LightGBM, NLP, Keras深度学习, 367页pdf

专知会员服务

115+阅读 · 2020年2月24日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

【《Scikit-Learn、Keras与TensorFlow机器学习实用指南(第二版)》电子书与代码(Notebooks)】Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition

【《Scikit-Learn、Keras与TensorFlow机器学习实用指南(第二版)》电子书与代码(Notebooks)】Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition

专知会员服务

219+阅读 · 2019年12月18日

【《图解深度学习》电子书与代码，830页pdf】’Deep Learning Illustrated (2019)' by Deep Learning Study Group GitHub

【《图解深度学习》电子书与代码，830页pdf】’Deep Learning Illustrated (2019)' by Deep Learning Study Group GitHub

专知会员服务

153+阅读 · 2019年1月1日

热门VIP内容

开通专知VIP会员享更多权益服务

《无人机与战争：被忽视的环境影响及无人机保护潜力》

俄罗斯规划未来无人机驱动军队

《整合杀伤链：一个用于边缘目标验证与战术推理的零样本框架》最新资料

《人工智能、武器与影响力：前沿模型在模拟核危机中展现复杂推理》2026最新46页报告

相关资讯

超50篇论文串联起从VQA到多模态预训练大模型的前世今生—Part 1

超50篇论文串联起从VQA到多模态预训练大模型的前世今生—Part 1

PaperWeekly

16+阅读 · 2022年4月29日

华为诺亚方舟预训练语言模型NEZHA、TinyBERT开源代码

华为诺亚方舟预训练语言模型NEZHA、TinyBERT开源代码

专知

17+阅读 · 2019年12月7日

计算机视觉方向简介 | 基于自然语言的跨模态行人re-id的SOTA方法（上）

计算机视觉方向简介 | 基于自然语言的跨模态行人re-id的SOTA方法（上）

计算机视觉life

12+阅读 · 2019年6月29日

MIT高赞深度学习教程：一文看懂CNN、RNN等7种范例（TensorFlow教程）

MIT高赞深度学习教程：一文看懂CNN、RNN等7种范例（TensorFlow教程）

全球人工智能

10+阅读 · 2019年5月5日

IBM-小样本学习（Few-shot Learning）State of the art 方法及论文讲解

IBM-小样本学习（Few-shot Learning）State of the art 方法及论文讲解

专知

105+阅读 · 2019年4月15日

小样本学习（Few-shot Learning）综述

小样本学习（Few-shot Learning）综述

云栖社区

22+阅读 · 2019年4月6日

ECCV2018教程146页《对抗机器学习》PPT教程（附PPT下载）

ECCV2018教程146页《对抗机器学习》PPT教程（附PPT下载）

专知

21+阅读 · 2018年9月7日

CMU大学76页深度学习课程：变分自编码器（VAE, Variational Autoencoder）

CMU大学76页深度学习课程：变分自编码器（VAE, Variational Autoencoder）

专知

28+阅读 · 2018年8月15日

机器翻译新时代：Facebook 开源无监督机器翻译模型和大规模训练语料

机器翻译新时代：Facebook 开源无监督机器翻译模型和大规模训练语料

机器学习研究会

12+阅读 · 2017年12月24日

TextInfoExp:自然语言处理相关实验（基于sougou数据集）

TextInfoExp:自然语言处理相关实验（基于sougou数据集）

全球人工智能

12+阅读 · 2017年11月12日

相关论文

PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use

Arxiv

0+阅读 · 1月28日

Mixture-of-Models: Unifying Heterogeneous Agents via N-Way Self-Evaluating Deliberation

Arxiv

0+阅读 · 1月23日

PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning

Arxiv

0+阅读 · 1月21日

Kakugo: Distillation of Low-Resource Languages into Small Language Models

Arxiv

0+阅读 · 1月20日

ArchAgent: Scalable Legacy Software Architecture Recovery with LLMs

Arxiv

0+阅读 · 1月19日

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

Arxiv

0+阅读 · 1月19日

SAMannot: A Memory-Efficient, Local, Open-source Framework for Interactive Video Instance Segmentation based on SAM2

Arxiv

0+阅读 · 1月16日

The Conversational Exam: A Scalable Assessment Design for the AI Era

Arxiv

0+阅读 · 1月15日

Contextual StereoSet: Stress-Testing Bias Alignment Robustness in Large Language Models

Arxiv

0+阅读 · 1月15日

Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought

Arxiv

0+阅读 · 1月13日

相关基金

基于复杂图知识表示的终身强化学习研究

国家自然科学基金

39+阅读 · 2015年12月31日

基于深层特征学习的RGB-D人体行为识别方法

国家自然科学基金

4+阅读 · 2015年12月31日

基于高斯过程模型的多示例多标记学习算法研究

国家自然科学基金

14+阅读 · 2015年12月31日

面向大规模分布式一致性最优化问题的结构型一阶求解算法研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于犹豫模糊语言信息的定性决策理论与方法

国家自然科学基金

2+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

天气影响下基于损失厌恶的“公司+农户”型农产品供应链风险管理研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于模型驱动的并发建模语言Apla+设计及其可靠性研究

国家自然科学基金

3+阅读 · 2014年12月31日

面向汉语文本理解的语义计算方法

国家自然科学基金

8+阅读 · 2014年12月31日

基于Markov博弈的计算机网络对抗行动策略分析与建模研究

国家自然科学基金

17+阅读 · 2013年12月31日

微信扫码咨询专知VIP会员