Does the training of large language models potentially infringe upon code licenses? Furthermore, are there any datasets available that can be safely used for training these models without violating such licenses? In our study, we assess the current trends in the field and the importance of incorporating code into the training of large language models. Additionally, we examine publicly available datasets to see whether these models can be trained on them without the risk of legal issues in the future. To accomplish this, we compiled a list of 53 large language models trained on file-level code. We then extracted their datasets and analyzed how much they overlap with a dataset we created, consisting exclusively of strong copyleft code. Our analysis revealed that every dataset we examined contained license inconsistencies, despite being selected based on their associated repository licenses. We analyzed a total of 514 million code files, discovering 38 million exact duplicates present in our strong copyleft dataset. Additionally, we examined 171 million file-leading comments, identifying 16 million with strong copyleft licenses and another 11 million comments that discouraged copying without explicitly mentioning a license. Based on the findings of our study, which highlights the pervasive issue of license inconsistencies in large language models trained on code, our recommendation for both researchers and the community is to prioritize the development and adoption of best practices for dataset creation and management.
翻译:大型语言模型的训练是否会侵犯代码许可证?此外,是否存在可在不违反此类许可证的前提下安全用于训练这些模型的数据集?在本研究中,我们评估了该领域的当前趋势,以及将代码纳入大型语言模型训练的重要性。同时,我们检查了公开可用的数据集,以了解这些模型能否在不面临未来法律风险的情况下对其进行训练。为此,我们整理了一份包含53个基于文件级代码训练的大语言模型清单,提取了它们的数据集,并分析这些数据集与我们专门创建的强著佐权代码数据集之间的重叠程度。我们的分析表明,尽管所有数据集均根据其关联仓库的许可证进行筛选,但每个被检数据集均存在许可证不一致问题。我们共分析了5.14亿个代码文件,发现其中3800万个与强著佐权数据集完全重复。此外,我们检查了1.71亿个文件开头的注释,识别出1600万个带有强著佐权许可证的注释,另有1100万个注释虽未明确提及许可证,但明确禁止复制行为。基于本研究的发现——揭示了基于代码训练的大语言模型中普遍存在的许可证不一致问题——我们建议研究人员及社区优先制定并采纳数据集创建与管理的最佳实践规范。