A First Look at License Compliance Capability of LLMs in Code Generation

Recent advances in Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. However, LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code by establishing a benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code. To establish this benchmark, we conduct an empirical study to identify a reasonable standard for "striking similarity" that excludes the possibility of independent creation, indicating a copy relationship between the LLM output and certain open-source code. Based on this standard, we propose an evaluation benchmark LiCoEval, to evaluate the license compliance capabilities of LLMs. Using LiCoEval, we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations. Notably, most LLMs fail to provide accurate license information, particularly for code under copyleft licenses. These findings underscore the urgent need to enhance LLM compliance capabilities in code generation tasks. Our study provides a foundation for future research and development to improve license compliance in AI-assisted software development, contributing to both the protection of open-source software copyrights and the mitigation of legal risks for LLM users.

翻译：近年来，大型语言模型（LLMs）的进展彻底改变了代码生成领域，导致AI编码工具在开发者中被广泛采用。然而，LLMs可能生成受许可证保护的代码而不提供必要的许可证信息，从而在软件生产过程中引发潜在的知识产权侵权。本文通过建立一个基准来评估LLMs为其生成代码提供准确许可证信息的能力，从而探讨了LLM生成代码中许可证合规这一关键但尚未被充分研究的问题。为建立此基准，我们进行了一项实证研究，以确定一个合理的"实质性相似"标准，该标准排除了独立创作的可能性，表明LLM输出与某些开源代码之间存在复制关系。基于此标准，我们提出了一个评估基准LiCoEval，用于评估LLMs的许可证合规能力。使用LiCoEval，我们评估了14个流行的LLMs，发现即使是表现最佳的LLMs也会产生不可忽视比例（0.88%至2.01%）与现有开源实现实质性相似的代码。值得注意的是，大多数LLMs未能提供准确的许可证信息，特别是对于采用Copyleft许可证的代码。这些发现凸显了在代码生成任务中增强LLM合规能力的迫切需求。我们的研究为未来改善AI辅助软件开发中许可证合规的研究与开发奠定了基础，既有助于保护开源软件的版权，也有助于降低LLM用户的法律风险。