Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Xingwu Sun,Yanfeng Chen,Yiqing Huang,Ruobing Xie,Jiaqi Zhu,Kai Zhang,Shuaipeng Li,Zhen Yang,Jonny Han,Xiaobo Shu,Jiahao Bu,Zhongzhi Chen,Xuemeng Huang,Fengzong Lian,Saiyong Yang,Jianfeng Yan,Yuyuan Zeng,Xiaoqin Ren,Chao Yu,Lulu Wu,Yue Mao,Jun Xia,Tao Yang,Suncong Zheng,Kan Wu,Dian Jiao,Jinbao Xue,Xipeng Zhang,Decheng Wu,Kai Liu,Dengpeng Wu,Guanghui Xu,Shaohua Chen,Shuang Chen,Xiao Feng,Yigeng Hong,Junqiang Zheng,Chengcheng Xu,Zongwei Li,Xiong Kuang,Jianglu Hu,Yiqi Chen,Yuchi Deng,Guiyang Li,Ao Liu,Chenchen Zhang,Shihui Hu,Zilong Zhao,Zifan Wu,Yao Ding,Weichao Wang,Han Liu,Roberts Wang,Hao Fei,Peijie Yu,Ze Zhao,Xun Cao,Hai Wang,Fusheng Xiang,Mengyuan Huang,Zhiyuan Xiong,Bin Hu,Xuebin Hou,Lei Jiang,Jianqiang Ma,Jiajia Wu,Yaping Deng,Yi Shen,Qian Wang,Weijie Liu,Jie Liu,Meng Chen,Liang Dong,Weiwen Jia,Hu Chen,Feifei Liu,Rui Yuan,Huilin Xu,Zhenxiang Yan,Tengfei Cao,Zhichao Hu,Xinhua Feng,Dong Du,Tinghao Yu,Yangyu Tao,Feng Zhang,Jianchen Zhu,Chengzhong Xu,Xirui Li,Chong Zha,Wen Ouyang,Yinben Xia,Xiang Li,Zekun He,Rongpeng Chen,Jiawei Song,Ruibin Chen,Fan Jiang,Chongqing Zhao,Bo Wang,Hao Gong,Rong Gan,Winston Hu,Zhanhui Kang,Yong Yang,Yuhong Liu,Di Wang,Jie Jiang

from arxiv, 17 pages, 4 Figures

In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large

翻译：本文介绍了Hunyuan-Large，这是目前最大的基于Transformer的开源专家混合模型，总参数量达3890亿，激活参数量为520亿，可处理长达256K的上下文。我们对Hunyuan-Large在多项基准测试中的卓越表现进行了全面评估，包括语言理解与生成、逻辑推理、数学问题求解、代码生成、长上下文处理及综合任务。结果显示，其性能超越LLama3.1-70B，并与参数量显著更大的LLama3.1-405B模型表现相当。Hunyuan-Large的核心实践包括：规模较以往文献提升数个数量级的大规模合成数据、混合专家路由策略、键值缓存压缩技术以及专家专用学习率策略。此外，我们还研究了专家混合模型的缩放规律与学习率调度机制，为未来模型的开发与优化提供了宝贵的见解与指导。Hunyuan-Large的代码与模型检查点已开源，以促进未来的创新与应用。代码地址：https://github.com/Tencent/Hunyuan-Large 模型地址：https://huggingface.co/tencent/Tencent-Hunyuan-Large

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日