In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.
翻译:近年来,大型语言模型(LLMs)因其生成类人文本的能力以及在软件工程等领域的潜在应用而广受关注。面向代码的大型语言模型通常基于从互联网抓取的大规模未清理源代码语料库进行训练。这些数据集的内容会被模型记忆并以逐字复现的方式输出。本文将探讨这种记忆机制在安全、隐私及许可协议方面的影响,论证使用copyleft代码训练大型语言模型所引发的法律与伦理困境,并最终提出四项切实可行的解决方案来应对这一问题。