We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. Our extensive evaluations on common coding tasks including code generation, code search and code summarization show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet. We also provide detailed analyses of our datasets to assess the effects of various programming languages and docstrings on the performance of such models.
翻译:我们提出《宝库》数据集,该数据集包含多种编程语言的高质量代码-文本对,用于训练大型语言模型理解与生成代码。我们提出了结合规则型与深度学习型方法的样本提取技术,以确保生成高质量的代码-文本对,最终构建了包含4300万组高质量代码-文本对的数据集。在代码生成、代码搜索及代码摘要等常见编码任务中的广泛评估表明,基于《宝库》微调的代码大型语言模型,其在同类任务上的表现优于基于CodeSearchNet等其他数据集训练的模型。我们还对数据集进行了详细分析,以评估不同编程语言及文档字符串对模型性能的影响。