The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.
翻译:大型语言模型的快速发展彻底革新了软件开发中的代码智能。然而,闭源模型的主导地位限制了广泛的研究与开发。为解决这一问题,我们推出了DeepSeek-Coder系列,这是一系列参数规模从1.3B到33B的开源代码模型,基于2万亿个token从头训练。这些模型在高质量的项目级代码语料上进行预训练,并采用16K窗口的填空任务来增强代码生成与补全能力。我们的广泛评估表明,DeepSeek-Coder不仅在多个基准测试中达到了开源代码模型中的最先进性能,还超越了Codex和GPT-3.5等现有闭源模型。此外,DeepSeek-Coder模型采用宽松许可协议,允许研究及无限制的商业使用。