Code large language models mark a pivotal breakthrough in artificial intelligence. They are specifically crafted to understand and generate programming languages, significantly boosting the efficiency of coding development workflows. In this technical report, we present CodeShell-Base, a seven billion-parameter foundation model with 8K context length, showcasing exceptional proficiency in code comprehension. By incorporating Grouped-Query Attention and Rotary Positional Embedding into GPT-2, CodeShell-Base integrates the structural merits of StarCoder and CodeLlama and forms its unique architectural design. We then carefully built a comprehensive data pre-processing process, including similar data deduplication, perplexity-based data filtering, and model-based data filtering. Through this process, We have curated 100 billion high-quality pre-training data from GitHub. Benefiting from the high-quality data, CodeShell-Base outperforms CodeLlama in Humaneval after training on just 500 billion tokens (5 epochs). We have conducted extensive experiments across multiple language datasets, including Python, Java, and C++, and the results indicate that our model possesses robust foundational capabilities in code comprehension and generation.
翻译:代码大语言模型标志着人工智能领域的一项突破性进展。这类模型专门针对编程语言的理解与生成任务设计,能显著提升编码开发流程的效率。本技术报告介绍了CodeShell-Base——一个拥有70亿参数、支持8K上下文长度的基础模型,其在代码理解方面展现出卓越能力。通过将分组查询注意力机制与旋转位置编码集成至GPT-2框架,CodeShell-Base融合了StarCoder与CodeLlama的结构优势,形成了独特的架构设计。我们精心构建了全面的数据预处理流程,包括相似数据去重、基于困惑度的数据过滤以及基于模型的数据过滤。凭借该流程,我们从GitHub中精选出1000亿高质量预训练数据。依托高质量数据优势,CodeShell-Base在仅完成5000亿token(5个周期)训练后,即可在Humaneval基准测试上超越CodeLlama。我们在包含Python、Java与C++的多语言数据集上进行了广泛实验,结果表明该模型在代码理解与生成方面具备扎实的基础能力。