In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.
翻译:近年来,大型语言模型因其生成类人文本的能力及其在软件工程等领域的潜在应用而广受关注。代码大语言模型通常使用从互联网抓取的大规模未清洗源代码语料库进行训练。这些数据集的内容会被模型记忆并以逐字复制的方式输出。本文将探讨这种记忆化现象在安全、隐私和许可协议方面的影响,论证使用Copyleft许可证代码训练大语言模型构成的法律与伦理困境,并最终提出四项可操作的建议以解决该问题。