We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
翻译:我们提出了DeepSeek-V3,这是一个强大的混合专家(MoE)语言模型,总参数量为671B,每个token激活的参数量为37B。为实现高效推理和成本效益高的训练,DeepSeek-V3采用了多头潜在注意力(MLA)和DeepSeekMoE架构,这些架构已在DeepSeek-V2中得到充分验证。此外,DeepSeek-V3开创了一种无辅助损失的负载均衡策略,并设定了多token预测训练目标以获得更强的性能。我们在14.8万亿个多样且高质量的token上对DeepSeek-V3进行了预训练,随后进行了监督微调和强化学习阶段,以充分发挥其能力。综合评估表明,DeepSeek-V3优于其他开源模型,并实现了与领先闭源模型相媲美的性能。尽管性能优异,DeepSeek-V3的完整训练仅需2.788M H800 GPU小时。此外,其训练过程非常稳定。在整个训练过程中,我们未遇到任何不可恢复的损失尖峰,也未执行任何回滚操作。模型检查点可在 https://github.com/deepseek-ai/DeepSeek-V3 获取。