We present The Vault, an open-source, large-scale code-text dataset designed to enhance the training of code-focused large language models (LLMs). Existing open-source datasets for training code-based LLMs often face challenges in terms of size, quality (due to noisy signals), and format (only containing code function and text explanation pairings). The Vault overcomes these limitations by providing 40 million code-text pairs across 10 popular programming languages, thorough cleaning for 10+ prevalent issues, and various levels of code-text pairings, including class, function, and line levels. Researchers and practitioners can utilize The Vault for training diverse code-focused LLMs or incorporate the provided data cleaning methods and scripts to improve their datasets. By employing The Vault as the training dataset for code-centric LLMs, we anticipate significant advancements in code understanding and generation tasks, fostering progress in both artificial intelligence research and software development practices.
翻译:我们提出"金库"(The Vault)——一个旨在增强面向代码的大语言模型(LLMs)训练的开源大规模代码-文本数据集。现有用于训练代码型LLMs的开源数据集常面临规模不足、质量受限(因噪声干扰)及格式单一(仅包含代码函数与文本解释配对)等挑战。"金库"通过提供覆盖10种主流编程语言的4000万条代码-文本对、针对10余种常见缺陷的彻底清洗以及涵盖类级、函数级和行级的多层次代码-文本配对,克服了上述局限。研究人员和从业者可利用"金库"训练各类代码型LLMs,或采纳其中提供的数据清洗方法与脚本优化自有数据集。通过将"金库"作为代码中心型LLMs的训练数据集,我们预期将在代码理解与生成任务中取得显著突破,推动人工智能研究与软件开发实践的协同发展。