Mellum：具备多文件项目理解能力的生产级IDE上下文代码补全模型 (Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project Understanding)

Nikita Pavlichenko,Iurii Nazarov,Ivan Dolgov,Ekaterina Garanina,Dmitry Ustalov,Ivan Bondyrev,Kseniia Lysaniuk,Evgeniia Vu,Kirill Chekmenev,Joseph Shtok,Yaroslav Golubev,Anton Semenkin,Uladzislau Sazanovich

from arxiv, 11 pages, 4 figures, 3 tables

We present the Mellum models family, open-weight code completion models designed for interactive use in JetBrains IDEs. Mellums have 4B parameters, adopt a Llama-style architecture, and are pre-trained on ~4T tokens of permissively licensed, multi-language code. Our studies show that (i) careful data curation and staged training significantly improve the model's quality, (ii) editor-critical capabilities such as context packing are necessary for high-quality suggestions, and (iii) a compact, task-focused model can meet the cost and latency constraints of interactive completion. In the paper, we describe an end-to-end industrial pipeline for producing contextualized in-editor completion: disciplined data governance, multi-stage training that includes fill-in-the-middle and project context via supervised fine-tuning, and alignment via direct preference optimization using feedback from real-world scenarios. Our quality evaluations include both large-scale offline benchmarks and online telemetry from production deployments in JetBrains IDEs. Mellums are released under the Apache-2.0 license on HuggingFace, with a public model card providing a reproducible reference for practitioners. Our experience offers a pragmatic blueprint for taking a focused, open model from a research prototype to at scale production for hundreds of thousands of users.

翻译：我们介绍了Mellum模型系列，这是一组专为JetBrains IDE交互使用设计的开放权重代码补全模型。Mellum模型拥有40亿参数，采用Llama风格架构，并在约4万亿个采用宽松许可证的多语言代码标记上进行预训练。我们的研究表明：(i) 精细的数据筛选和分阶段训练能显著提升模型质量，(ii) 上下文打包等编辑器关键能力对高质量建议至关重要，(iii) 紧凑的任务导向模型能够满足交互式补全的成本与延迟约束。本文描述了生产上下文编辑器补全的端到端工业流程：规范化的数据治理、包含填空训练和通过监督微调实现项目上下文的多阶段训练，以及基于真实场景反馈通过直接偏好优化进行的对齐。我们的质量评估既包含大规模离线基准测试，也涵盖JetBrains IDE生产部署的在线遥测数据。Mellum模型以Apache-2.0许可证发布于HuggingFace平台，公开的模型卡片为实践者提供了可复现的参考标准。我们的实践经验为将专注的开放模型从研究原型推向数十万用户规模的生产部署提供了实用蓝图。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日