The widespread use of foundation models has introduced a new risk factor of copyright issue. This issue is leading to an active, lively and on-going debate amongst the data-science community as well as amongst legal scholars. Where claims and results across both sides are often interpreted in different ways and leading to different implications. Our position is that much of the technical literature relies on traditional reconstruction techniques that are not designed for copyright analysis. As a result, memorization and copying have been conflated across both technical and legal communities and in multiple contexts. We argue that memorization, as commonly studied in data science, should not be equated with copying and should not be used as a proxy for copyright infringement. We distinguish technical signals that meaningfully indicate infringement risk from those that instead reflect lawful generalization or high-frequency content. Based on this analysis, we advocate for an output-level, risk-based evaluation process that aligns technical assessments with established copyright standards and provides a more principled foundation for research, auditing, and policy.
翻译:基础模型的广泛使用引入了版权问题这一新的风险因素。这一问题正在数据科学界和法律学者中引发一场活跃、热烈且持续的辩论。双方的主张和结果常以不同方式被解读,并导致不同的影响。我们的立场是,大量技术文献依赖传统的重建技术,而这些技术并非为版权分析而设计。因此,在技术和法律界以及多种语境下,记忆与复制已被混为一谈。我们认为,数据科学中通常研究的记忆不应等同于复制,也不应被用作版权侵权的替代指标。我们区分了那些有意义地指示侵权风险的技术信号与那些仅反映合法泛化或高频内容的信号。基于此分析,我们倡导一种输出层面的、基于风险的评估流程,该流程能使技术评估与既定的版权标准保持一致,并为研究、审计和政策制定提供更具原则性的基础。