Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit

Copyright infringement in frontier LLMs has received much attention recently due to the New York Times v. OpenAI lawsuit, filed in December 2023. The New York Times claims that GPT-4 has infringed its copyrights by reproducing articles for use in LLM training and by memorizing the inputs, thereby publicly displaying them in LLM outputs. Our work aims to measure the propensity of OpenAI's LLMs to exhibit verbatim memorization in its outputs relative to other LLMs, specifically focusing on news articles. We discover that both GPT and Claude models use refusal training and output filters to prevent verbatim output of the memorized articles. We apply a basic prompt template to bypass the refusal training and show that OpenAI models are currently less prone to memorization elicitation than models from Meta, Mistral, and Anthropic. We find that as models increase in size, especially beyond 100 billion parameters, they demonstrate significantly greater capacity for memorization. Our findings have practical implications for training: more attention must be placed on preventing verbatim memorization in very large models. Our findings also have legal significance: in assessing the relative memorization capacity of OpenAI's LLMs, we probe the strength of The New York Times's copyright infringement claims and OpenAI's legal defenses, while underscoring issues at the intersection of generative AI, law, and policy.

翻译：前沿大语言模型中的版权侵权问题因 2023 年 12 月提起的《纽约时报》诉 OpenAI 诉讼案而备受关注。《纽约时报》声称，GPT-4 通过复制其文章用于大语言模型训练，并通过记忆输入内容并在大语言模型输出中公开展示，从而侵犯了其版权。本研究旨在衡量 OpenAI 的大语言模型相对于其他大语言模型在输出中呈现逐字记忆的倾向，特别聚焦于新闻报道。我们发现，GPT 和 Claude 模型均采用拒绝训练和输出过滤器来防止逐字输出已记忆的文章。我们应用基础提示模板绕过拒绝训练，结果表明当前 OpenAI 模型比 Meta、Mistral 和 Anthropic 的模型更不易引发记忆化。我们发现，随着模型规模增大，尤其是参数超过 1000 亿时，其记忆能力显著增强。我们的发现对训练具有实际意义：必须更加关注防止超大规模模型中的逐字记忆。我们的发现也具有法律意义：通过评估 OpenAI 大语言模型的相对记忆能力，我们探究了《纽约时报》版权侵权主张的力度与 OpenAI 法律辩护的强度，同时突显了生成式人工智能、法律与政策交叉领域的问题。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日