Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material. These models can memorize and generate content similar to their training data, posing potential concerns. Therefore, model creators are motivated to develop mitigation methods that prevent generating protected content. We term this procedure as copyright takedowns for LMs, noting the conceptual similarity to (but legal distinction from) the DMCA takedown This paper introduces the first evaluation of the feasibility and side effects of copyright takedowns for LMs. We propose CoTaEval, an evaluation framework to assess the effectiveness of copyright takedown methods, the impact on the model's ability to retain uncopyrightable factual knowledge from the training data whose recitation is embargoed, and how well the model maintains its general utility and efficiency. We examine several strategies, including adding system prompts, decoding-time filtering interventions, and unlearning approaches. Our findings indicate that no tested method excels across all metrics, showing significant room for research in this unique problem setting and indicating potential unresolved challenges for live policy proposals.
翻译:语言模型(LMs)的能力源于对多样化数据(包括可能受版权保护的材料)的广泛训练。这些模型能够记忆并生成与其训练数据相似的内容,这引发了潜在担忧。因此,模型开发者有动机开发缓解方法,以防止生成受保护内容。我们将此过程称为语言模型的版权移除,并指出其与《数字千年版权法案》(DMCA)移除在概念上的相似性(但法律上存在区别)。本文首次对语言模型版权移除的可行性及副作用进行了评估。我们提出了CoTaEval评估框架,用以评估版权移除方法的有效性、对模型保留训练数据中非版权可保护事实知识能力的影响(尽管这些知识的复述受到限制),以及模型维持其通用效用和效率的程度。我们检验了多种策略,包括添加系统提示、解码时过滤干预以及遗忘学习等方法。我们的研究结果表明,在所有测试方法中,没有一种能在所有指标上表现优异,这显示出在这一独特问题设定下存在显著的研究空间,并表明现行政策提案可能面临尚未解决的挑战。