Training generative AI models requires extensive amounts of data. A common practice is to collect such data through web scraping. Yet, much of what has been and is collected is copyright protected. Its use may be copyright infringement. In the USA, AI developers rely on "fair use" and in Europe, the prevailing view is that the exception for "Text and Data Mining" (TDM) applies. In a recent interdisciplinary tandem-study, we have argued in detail that this is actually not the case because generative AI training fundamentally differs from TDM. In this article, we share our main findings and the implications for both public and corporate research on generative models. We further discuss how the phenomenon of training data memorization leads to copyright issues independently from the "fair use" and TDM exceptions. Finally, we outline how the ISMIR could contribute to the ongoing discussion about fair practices with respect to generative AI that satisfy all stakeholders.
翻译:训练生成式人工智能模型需要大量数据。通常的做法是通过网络爬虫收集此类数据。然而,已收集和正在收集的数据大多受版权保护。其使用可能构成版权侵权。在美国,人工智能开发者依赖“合理使用”原则;而在欧洲,主流观点认为“文本与数据挖掘”(TDM)例外条款适用。在近期一项跨学科串联研究中,我们详细论证了实际情况并非如此,因为生成式人工智能训练在根本上不同于TDM。本文中,我们分享了主要研究发现及其对生成式模型的公共研究和企业研究的影响。我们进一步探讨了训练数据记忆现象如何独立于“合理使用”和TDM例外条款引发版权问题。最后,我们概述了国际音乐信息检索学会(ISMIR)如何为推动关于满足所有利益相关者需求的生成式人工智能公平实践的持续讨论作出贡献。