LLMs and Memorization: On Quality and Specificity of Copyright Compliance

Memorization in large language models (LLMs) is a growing concern. LLMs have been shown to easily reproduce parts of their training data, including copyrighted work. This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act. In this work, we propose a systematic analysis to quantify the extent of potential copyright infringements in LLMs using European law as an example. Unlike previous work, we evaluate instruction-finetuned models in a realistic end-user scenario. Our analysis builds on a proposed threshold of 160 characters, which we borrow from the German Copyright Service Provider Act and a fuzzy text matching algorithm to identify potentially copyright-infringing textual reproductions. The specificity of countermeasures against copyright infringement is analyzed by comparing model behavior on copyrighted and public domain data. We investigate what behaviors models show instead of producing protected text (such as refusal or hallucination) and provide a first legal assessment of these behaviors. We find that there are huge differences in copyright compliance, specificity, and appropriate refusal among popular LLMs. Alpaca, GPT 4, GPT 3.5, and Luminous perform best in our comparison, with OpenGPT-X, Alpaca, and Luminous producing a particularly low absolute number of potential copyright violations. Code will be published soon.

翻译：大型语言模型（LLMs）的记忆化问题日益受到关注。研究表明，LLMs容易复现其训练数据中的部分内容，包括受版权保护的作品。这是一个亟待解决的重要问题，因为它可能违反现行版权法及欧洲《人工智能法案》。本研究以欧洲法律为例，提出一种系统性分析方法，用于量化LLMs中潜在版权侵权的程度。与以往研究不同，我们在真实的终端用户场景中评估了指令微调模型。我们的分析基于160字符的阈值（借鉴自德国《版权服务提供者法案》）和模糊文本匹配算法，以识别潜在的版权侵权文本复现。通过比较模型在受版权保护数据和公共领域数据上的行为，我们分析了反版权侵权措施的特异性。我们探究了模型在避免生成受保护文本时表现出的替代行为（如拒绝或幻觉），并对这些行为进行了初步法律评估。研究发现，主流LLMs在版权合规性、特异性及合理拒绝方面存在显著差异。Alpaca、GPT 4、GPT 3.5和Luminous在比较中表现最佳，其中OpenGPT-X、Alpaca和Luminous产生的潜在版权侵权绝对数量尤其低。相关代码即将发布。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

37+阅读 · 2019年10月17日