There is a growing concern that generative AI models will generate outputs closely resembling the copyrighted materials for which they are trained. This worry has intensified as the quality and complexity of generative models have immensely improved, and the availability of extensive datasets containing copyrighted material has expanded. Researchers are actively exploring strategies to mitigate the risk of generating infringing samples, with a recent line of work suggesting to employ techniques such as differential privacy and other forms of algorithmic stability to provide guarantees on the lack of infringing copying. In this work, we examine whether such algorithmic stability techniques are suitable to ensure the responsible use of generative models without inadvertently violating copyright laws. We argue that while these techniques aim to verify the presence of identifiable information in datasets, thus being privacy-oriented, copyright law aims to promote the use of original works for the benefit of society as a whole, provided that no unlicensed use of protected expression occurred. These fundamental differences between privacy and copyright must not be overlooked. In particular, we demonstrate that while algorithmic stability may be perceived as a practical tool to detect copying, such copying does not necessarily constitute copyright infringement. Therefore, if adopted as a standard for detecting an establishing copyright infringement, algorithmic stability may undermine the intended objectives of copyright law.
翻译:随着生成式人工智能模型在训练中使用版权材料后生成高度相似输出的担忧日益加剧——尤其是随着模型质量与复杂度的显著提升及包含版权材料的大规模数据集的可获取性扩大——研究人员正积极探索降低侵权样本生成风险的策略。近期一系列研究建议采用差分隐私及其他算法稳定性技术,为不存在侵权复制行为提供保障。本研究探究此类算法稳定性技术是否适合在避免无意违反版权法的前提下确保生成式模型的负责任使用。我们论证指出:尽管这些技术旨在验证数据集中可识别信息的存在性(因而属于隐私导向),但版权法的核心目标是在未发生未经授权的受保护表达使用前提下,推动原创作品为全社会福祉而使用。隐私与版权之间的这些根本性差异不容忽视。特别需要说明的是,我们证明虽然算法稳定性可能被视为检测复制的实用工具,但此类复制行为并不必然构成版权侵权。因此,若将算法稳定性采纳为检测和认定版权侵权的标准,可能会损害版权法的预期立法目标。