Memorization in large language models (LLMs) is a growing concern. LLMs have been shown to easily reproduce parts of their training data, including copyrighted work. This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act. In this work, we propose a systematic analysis to quantify the extent of potential copyright infringements in LLMs using European law as an example. Unlike previous work, we evaluate instruction-finetuned models in a realistic end-user scenario. Our analysis builds on a proposed threshold of 160 characters, which we borrow from the German Copyright Service Provider Act and a fuzzy text matching algorithm to identify potentially copyright-infringing textual reproductions. The specificity of countermeasures against copyright infringement is analyzed by comparing model behavior on copyrighted and public domain data. We investigate what behaviors models show instead of producing protected text (such as refusal or hallucination) and provide a first legal assessment of these behaviors. We find that there are huge differences in copyright compliance, specificity, and appropriate refusal among popular LLMs. Alpaca, GPT 4, GPT 3.5, and Luminous perform best in our comparison, with OpenGPT-X, Alpaca, and Luminous producing a particularly low absolute number of potential copyright violations. Code will be published soon.
翻译:大型语言模型(LLMs)的记忆化问题日益受到关注。研究表明,LLMs容易复现其训练数据中的部分内容,包括受版权保护的作品。这是一个亟待解决的重要问题,因为它可能违反现行版权法及欧洲《人工智能法案》。本研究以欧洲法律为例,提出一种系统性分析方法,用于量化LLMs中潜在版权侵权的程度。与以往研究不同,我们在真实的终端用户场景中评估了指令微调模型。我们的分析基于160字符的阈值(借鉴自德国《版权服务提供者法案》)和模糊文本匹配算法,以识别潜在的版权侵权文本复现。通过比较模型在受版权保护数据和公共领域数据上的行为,我们分析了反版权侵权措施的特异性。我们探究了模型在避免生成受保护文本时表现出的替代行为(如拒绝或幻觉),并对这些行为进行了初步法律评估。研究发现,主流LLMs在版权合规性、特异性及合理拒绝方面存在显著差异。Alpaca、GPT 4、GPT 3.5和Luminous在比较中表现最佳,其中OpenGPT-X、Alpaca和Luminous产生的潜在版权侵权绝对数量尤其低。相关代码即将发布。