We present the Mellum models family, open-weight code completion models designed for interactive use in JetBrains IDEs. Mellums have 4B parameters, adopt a Llama-style architecture, and are pre-trained on ~4T tokens of permissively licensed, multi-language code. Our studies show that (i) careful data curation and staged training significantly improve the model's quality, (ii) editor-critical capabilities such as context packing are necessary for high-quality suggestions, and (iii) a compact, task-focused model can meet the cost and latency constraints of interactive completion. In the paper, we describe an end-to-end industrial pipeline for producing contextualized in-editor completion: disciplined data governance, multi-stage training that includes fill-in-the-middle and project context via supervised fine-tuning, and alignment via direct preference optimization using feedback from real-world scenarios. Our quality evaluations include both large-scale offline benchmarks and online telemetry from production deployments in JetBrains IDEs. Mellums are released under the Apache-2.0 license on HuggingFace, with a public model card providing a reproducible reference for practitioners. Our experience offers a pragmatic blueprint for taking a focused, open model from a research prototype to at scale production for hundreds of thousands of users.
翻译:我们介绍了Mellum模型系列,这是一组专为JetBrains IDE交互使用设计的开放权重代码补全模型。Mellum模型拥有40亿参数,采用Llama风格架构,并在约4万亿个采用宽松许可证的多语言代码标记上进行预训练。我们的研究表明:(i) 精细的数据筛选和分阶段训练能显著提升模型质量,(ii) 上下文打包等编辑器关键能力对高质量建议至关重要,(iii) 紧凑的任务导向模型能够满足交互式补全的成本与延迟约束。本文描述了生产上下文编辑器补全的端到端工业流程:规范化的数据治理、包含填空训练和通过监督微调实现项目上下文的多阶段训练,以及基于真实场景反馈通过直接偏好优化进行的对齐。我们的质量评估既包含大规模离线基准测试,也涵盖JetBrains IDE生产部署的在线遥测数据。Mellum模型以Apache-2.0许可证发布于HuggingFace平台,公开的模型卡片为实践者提供了可复现的参考标准。我们的实践经验为将专注的开放模型从研究原型推向数十万用户规模的生产部署提供了实用蓝图。