CodeShell Technical Report

Code large language models mark a pivotal breakthrough in artificial intelligence. They are specifically crafted to understand and generate programming languages, significantly boosting the efficiency of coding development workflows. In this technical report, we present CodeShell-Base, a seven billion-parameter foundation model with 8K context length, showcasing exceptional proficiency in code comprehension. By incorporating Grouped-Query Attention and Rotary Positional Embedding into GPT-2, CodeShell-Base integrates the structural merits of StarCoder and CodeLlama and forms its unique architectural design. We then carefully built a comprehensive data pre-processing process, including similar data deduplication, perplexity-based data filtering, and model-based data filtering. Through this process, We have curated 100 billion high-quality pre-training data from GitHub. Benefiting from the high-quality data, CodeShell-Base outperforms CodeLlama in Humaneval after training on just 500 billion tokens (5 epochs). We have conducted extensive experiments across multiple language datasets, including Python, Java, and C++, and the results indicate that our model possesses robust foundational capabilities in code comprehension and generation.

翻译：代码大语言模型标志着人工智能领域的一项突破性进展。这类模型专门针对编程语言的理解与生成任务设计，能显著提升编码开发流程的效率。本技术报告介绍了CodeShell-Base——一个拥有70亿参数、支持8K上下文长度的基础模型，其在代码理解方面展现出卓越能力。通过将分组查询注意力机制与旋转位置编码集成至GPT-2框架，CodeShell-Base融合了StarCoder与CodeLlama的结构优势，形成了独特的架构设计。我们精心构建了全面的数据预处理流程，包括相似数据去重、基于困惑度的数据过滤以及基于模型的数据过滤。凭借该流程，我们从GitHub中精选出1000亿高质量预训练数据。依托高质量数据优势，CodeShell-Base在仅完成5000亿token（5个周期）训练后，即可在Humaneval基准测试上超越CodeLlama。我们在包含Python、Java与C++的多语言数据集上进行了广泛实验，结果表明该模型在代码理解与生成方面具备扎实的基础能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日