LiCoEval: Evaluating LLMs on License Compliance in Code Generation

Recent advances in Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. However, LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code by establishing a benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code. To establish this benchmark, we conduct an empirical study to identify a reasonable standard for "striking similarity" that excludes the possibility of independent creation, indicating a copy relationship between the LLM output and certain open-source code. Based on this standard, we propose LiCoEval, to evaluate the license compliance capabilities of LLMs, i.e., the ability to provide accurate license or copyright information when they generate code with striking similarity to already existing copyrighted code. Using LiCoEval, we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations. Notably, most LLMs fail to provide accurate license information, particularly for code under copyleft licenses. These findings underscore the urgent need to enhance LLM compliance capabilities in code generation tasks. Our study provides a foundation for future research and development to improve license compliance in AI-assisted software development, contributing to both the protection of open-source software copyrights and the mitigation of legal risks for LLM users.

翻译：近年来，大语言模型（LLMs）在代码生成领域取得了革命性进展，导致AI编程工具在开发者中被广泛采用。然而，LLMs可能生成受许可证保护的代码而不提供必要的许可证信息，从而在软件开发过程中引发潜在的知识产权侵权风险。本文针对LLM生成代码中至关重要但尚未被充分探索的许可证合规性问题，通过建立一个基准来评估LLMs为其生成代码提供准确许可证信息的能力。为构建此基准，我们开展了一项实证研究，以确定一个合理的"实质性相似"标准，该标准排除了独立创作的可能性，表明LLM输出与特定开源代码之间存在复制关系。基于此标准，我们提出了LiCoEval，用于评估LLMs的许可证合规能力，即当它们生成与现有受版权保护代码具有实质性相似的代码时，提供准确许可证或版权信息的能力。利用LiCoEval，我们评估了14个主流LLMs，发现即使是性能最优的模型也会产生不可忽视比例（0.88%至2.01%）与现有开源实现实质性相似的代码。值得注意的是，大多数LLMs未能提供准确的许可证信息，特别是对于采用著佐权（copyleft）许可证的代码。这些发现凸显了在代码生成任务中增强LLM合规能力的迫切需求。我们的研究为未来改进AI辅助软件开发中许可证合规性的研究与开发奠定了基础，既有助于保护开源软件版权，也能降低LLM用户的法律风险。