Do Language Models Plagiarize?

Past literature has illustrated that language models (LMs) often memorize parts of training instances and reproduce them in natural language generation (NLG) processes. However, it is unclear to what extent LMs "reuse" a training corpus. For instance, models can generate paraphrased sentences that are contextually similar to training samples. In this work, therefore, we study three types of plagiarism (i.e., verbatim, paraphrase, and idea) among GPT-2 generated texts, in comparison to its training data, and further analyze the plagiarism patterns of fine-tuned LMs with domain-specific corpora which are extensively used in practice. Our results suggest that (1) three types of plagiarism widely exist in LMs beyond memorization, (2) both size and decoding methods of LMs are strongly associated with the degrees of plagiarism they exhibit, and (3) fine-tuned LMs' plagiarism patterns vary based on their corpus similarity and homogeneity. Given that a majority of LMs' training data is scraped from the Web without informing content owners, their reiteration of words, phrases, and even core ideas from training sets into generated texts has ethical implications. Their patterns are likely to exacerbate as both the size of LMs and their training data increase, raising concerns about indiscriminately pursuing larger models with larger training corpora. Plagiarized content can also contain individuals' personal and sensitive information. These findings overall cast doubt on the practicality of current LMs in mission-critical writing tasks and urge more discussions around the observed phenomena. Data and source code are available at https://github.com/Brit7777/LM-plagiarism.

翻译：已有文献表明，语言模型常会记忆训练实例的部分内容，并在自然语言生成过程中复现。然而，模型在多大程度上"复用"训练语料尚不明确——例如，模型可能生成与训练样本语境相似的改写句。为此，本研究以GPT-2生成文本为对象，将其与训练数据进行比对，从三种剽窃类型（逐字抄袭、同义改写、思想窃取）展开分析，并进一步研究了实践中广泛使用的领域特定语料微调语言模型的剽窃模式。结果表明：（1）三种剽窃类型广泛存在于语言模型中，且超出单纯的记忆范畴；（2）语言模型的规模与解码方法与其剽窃程度密切相关；（3）微调语言模型的剽窃模式随语料相似性与同质性变化。鉴于多数语言模型的训练数据未经内容所有者许可即从网络抓取，模型将训练集中的词语、短语乃至核心思想反复注入生成文本的行为具有伦理隐忧。随着模型规模与训练数据量的同步增长，此类模式可能进一步加剧，引发对盲目追求更大模型与更大语料的质疑。剽窃内容还可能包含个人敏感信息。这些发现总体上动摇了当前语言模型在关键性写作任务中的实用性，亟待围绕观察到的现象展开更深入讨论。数据与源代码见https://github.com/Brit7777/LM-plagiarism。