Copyright infringement in frontier LLMs has received much attention recently due to the New York Times v. OpenAI lawsuit, filed in December 2023. The New York Times claims that GPT-4 has infringed its copyrights by reproducing articles for use in LLM training and by memorizing the inputs, thereby publicly displaying them in LLM outputs. Our work aims to measure the propensity of OpenAI's LLMs to exhibit verbatim memorization in its outputs relative to other LLMs, specifically focusing on news articles. We discover that both GPT and Claude models use refusal training and output filters to prevent verbatim output of the memorized articles. We apply a basic prompt template to bypass the refusal training and show that OpenAI models are currently less prone to memorization elicitation than models from Meta, Mistral, and Anthropic. We find that as models increase in size, especially beyond 100 billion parameters, they demonstrate significantly greater capacity for memorization. Our findings have practical implications for training: more attention must be placed on preventing verbatim memorization in very large models. Our findings also have legal significance: in assessing the relative memorization capacity of OpenAI's LLMs, we probe the strength of The New York Times's copyright infringement claims and OpenAI's legal defenses, while underscoring issues at the intersection of generative AI, law, and policy.
翻译:前沿大语言模型中的版权侵权问题因 2023 年 12 月提起的《纽约时报》诉 OpenAI 诉讼案而备受关注。《纽约时报》声称,GPT-4 通过复制其文章用于大语言模型训练,并通过记忆输入内容并在大语言模型输出中公开展示,从而侵犯了其版权。本研究旨在衡量 OpenAI 的大语言模型相对于其他大语言模型在输出中呈现逐字记忆的倾向,特别聚焦于新闻报道。我们发现,GPT 和 Claude 模型均采用拒绝训练和输出过滤器来防止逐字输出已记忆的文章。我们应用基础提示模板绕过拒绝训练,结果表明当前 OpenAI 模型比 Meta、Mistral 和 Anthropic 的模型更不易引发记忆化。我们发现,随着模型规模增大,尤其是参数超过 1000 亿时,其记忆能力显著增强。我们的发现对训练具有实际意义:必须更加关注防止超大规模模型中的逐字记忆。我们的发现也具有法律意义:通过评估 OpenAI 大语言模型的相对记忆能力,我们探究了《纽约时报》版权侵权主张的力度与 OpenAI 法律辩护的强度,同时突显了生成式人工智能、法律与政策交叉领域的问题。