Motivated by recent work on lifelong learning applications for language models (LMs) of code, we introduce CodeLL, a lifelong learning dataset focused on code changes. Our contribution addresses a notable research gap marked by the absence of a long-term temporal dimension in existing code change datasets, limiting their suitability in lifelong learning scenarios. In contrast, our dataset aims to comprehensively capture code changes across the entire release history of open-source software repositories. In this work, we introduce an initial version of CodeLL, comprising 71 machine-learning-based projects mined from Software Heritage. This dataset enables the extraction and in-depth analysis of code changes spanning 2,483 releases at both the method and API levels. CodeLL enables researchers studying the behaviour of LMs in lifelong fine-tuning settings for learning code changes. Additionally, the dataset can help studying data distribution shifts within software repositories and the evolution of API usages over time.
翻译:[translated abstract in Chinese]
受近期代码语言模型(LMs)终身学习应用研究的启发,我们提出了CodeLL——一个聚焦代码变更的终身学习数据集。我们的贡献填补了一个显著的研究空白:现有代码变更数据集缺乏长期时间维度,限制了其在终身学习场景中的适用性。相比之下,本数据集旨在全面捕获开源软件仓库整个发布历史中的代码变更。本文介绍了CodeLL的初始版本,包含从Software Heritage挖掘的71个基于机器学习的项目。该数据集支持提取并深入分析跨越2,483个版本的方法级和API级代码变更。CodeLL使研究人员能够研究LMs在终身微调设置下学习代码变更的行为,同时还可用于分析软件仓库内的数据分布漂移以及API用法随时间的演化。