The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.
翻译:大语言模型的快速发展彻底变革了软件开发中的代码智能。然而,闭源模型的垄断限制了广泛的研究与开发。为解决这一问题,我们推出了DeepSeek-Coder系列,这是一系列参数量从1.3B到33B的开源代码模型,基于2万亿token从头训练。这些模型在高质量项目级代码语料上预训练,并采用16K窗口的填空任务以增强代码生成与补全能力。广泛评估表明,DeepSeek-Coder不仅在多项基准测试中达到开源代码模型的顶尖性能,还超越了Codex和GPT-3.5等现有闭源模型。此外,DeepSeek-Coder模型采用宽松许可证,允许研究及不受限制的商业使用。