Not All Similarities Are Created Equal: Leveraging Data-Driven Biases to Inform GenAI Copyright Disputes

The advent of Generative Artificial Intelligence (GenAI) models, including GitHub Copilot, OpenAI GPT, and Stable Diffusion, has revolutionized content creation, enabling non-professionals to produce high-quality content across various domains. This transformative technology has led to a surge of synthetic content and sparked legal disputes over copyright infringement. To address these challenges, this paper introduces a novel approach that leverages the learning capacity of GenAI models for copyright legal analysis, demonstrated with GPT2 and Stable Diffusion models. Copyright law distinguishes between original expressions and generic ones (Sc\`enes \`a faire), protecting the former and permitting reproduction of the latter. However, this distinction has historically been challenging to make consistently, leading to over-protection of copyrighted works. GenAI offers an unprecedented opportunity to enhance this legal analysis by revealing shared patterns in preexisting works. We propose a data-driven approach to identify the genericity of works created by GenAI, employing "data-driven bias" to assess the genericity of expressive compositions. This approach aids in copyright scope determination by utilizing the capabilities of GenAI to identify and prioritize expressive elements and rank them according to their frequency in the model's dataset. The potential implications of measuring expressive genericity for copyright law are profound. Such scoring could assist courts in determining copyright scope during litigation, inform the registration practices of Copyright Offices, allowing registration of only highly original synthetic works, and help copyright owners signal the value of their works and facilitate fairer licensing deals. More generally, this approach offers valuable insights to policymakers grappling with adapting copyright law to the challenges posed by the era of GenAI.

翻译：生成式人工智能（GenAI）模型（包括GitHub Copilot、OpenAI GPT和Stable Diffusion）的出现彻底改变了内容创作方式，使非专业人士也能在各个领域生成高质量内容。这一变革性技术带来了合成内容的激增，并引发了版权侵权的法律纠纷。为应对这些挑战，本文提出了一种新颖方法，利用GenAI模型的学习能力进行版权法律分析，并以GPT2和Stable Diffusion模型为例进行示范。版权法区分原创表达与通用表达（场景惯例），前者受保护，后者允许复制。然而，历史上这一区分难以保持一致，导致版权作品过度保护。GenAI通过揭示既有作品中的共有模式，为强化这一法律分析提供了前所未有的机遇。我们提出一种数据驱动方法，用于识别GenAI生成作品的通用性，采用"数据驱动偏差"评估表达性内容的通用程度。该方法利用GenAI识别并优先处理表达性元素的能力，根据这些元素在模型数据集中出现的频率进行排序，从而辅助确定版权范围。衡量表达通用性对版权法具有深远影响。此类评分可协助法院在诉讼中确定版权范围，指导版权局的注册实践（仅允许高度原创的合成作品注册），并帮助版权所有者彰显作品价值、促成更公平的许可协议。更广泛而言，该方法为政策制定者在调整版权法以应对GenAI时代挑战时提供了宝贵洞见。