MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Ge Zhang,Scott Qu,Jiaheng Liu,Chenchen Zhang,Chenghua Lin,Chou Leuang Yu,Danny Pan,Esther Cheng,Jie Liu,Qunshu Lin,Raven Yuan,Tuney Zheng,Wei Pang,Xinrun Du,Yiming Liang,Yinghao Ma,Yizhi Li,Ziyang Ma,Bill Lin,Emmanouil Benetos,Huan Yang,Junting Zhou,Kaijing Ma,Minghao Liu,Morry Niu,Noah Wang,Quehry Que,Ruibo Liu,Sine Liu,Shawn Guo,Soren Gao,Wangchunshu Zhou,Xinyue Zhang,Yizhi Zhou,Yubo Wang,Yuelin Bai,Yuhan Zhang,Yuxiang Zhang,Zenith Wang,Zhenzhu Yang,Zijian Zhao,Jiajun Zhang,Wanli Ouyang,Wenhao Huang,Wenhu Chen

from arxiv, https://map-neo.github.io/

Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model's weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs.

翻译：近年来，大语言模型（LLMs）取得了巨大进展，在各种任务上实现了前所未有的性能。然而，出于商业利益考量，最具竞争力的模型（如GPT、Gemini和Claude）均通过专有接口封闭提供，未公开训练细节。近期，许多机构开源了多个性能强大的LLM（如LLaMA-3），其能力可与现有闭源LLM相媲美。但这些模型通常仅提供权重参数，多数关键细节（如中间检查点、预训练语料库和训练代码等）仍未公开。为提升LLM的透明度，研究界已开始推动真正开源LLM的发展（如Pythia、Amber、OLMo），这些模型提供了更多细节（如预训练语料库和训练代码）。这些工作极大推进了对大模型优势、缺陷、偏见及风险等方面的科学研究。然而，我们观察到现有真正开源的LLM在推理、知识和代码任务上的表现，仍逊于同规模的最先进LLM。为此，我们开源了MAP-Neo——一个基于4.5万亿高质量词元从头训练、拥有70亿参数的高性能透明双语语言模型。MAP-Neo是首个完全开源且性能与现有最先进LLM相当的双语LLM。此外，我们开源了复现MAP-Neo所需的全部细节，包括清洗后的预训练语料库、数据清洗流程、检查点以及高度优化的训练/评估框架。我们期待MAP-Neo能够推动开源研究社区的发展，激发更多创新与创造力，从而促进LLM的持续进步。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日