Foundation Models and Fair Use

Existing foundation models are trained on copyrighted material. Deploying these models can pose both legal and ethical risks when data creators fail to receive appropriate attribution or compensation. In the United States and several other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine. However, there is a caveat: If the model produces output that is similar to copyrighted data, particularly in scenarios that affect the market of that data, fair use may no longer apply to the output of the model. In this work, we emphasize that fair use is not guaranteed, and additional work may be necessary to keep model development and deployment squarely in the realm of fair use. First, we survey the potential risks of developing and deploying foundation models based on copyrighted content. We review relevant U.S. case law, drawing parallels to existing and potential applications for generating text, source code, and visual art. Experiments confirm that popular foundation models can generate content considerably similar to copyrighted material. Second, we discuss technical mitigations that can help foundation models stay in line with fair use. We argue that more research is needed to align mitigation strategies with the current state of the law. Lastly, we suggest that the law and technical mitigations should co-evolve. For example, coupled with other policy mechanisms, the law could more explicitly consider safe harbors when strong technical tools are used to mitigate infringement harms. This co-evolution may help strike a balance between intellectual property and innovation, which speaks to the original goal of fair use. But we emphasize that the strategies we describe here are not a panacea and more work is needed to develop policies that address the potential harms of foundation models.

翻译：现有基础模型均基于受版权保护的材料进行训练。当数据创作者未能获得适当的归属或补偿时，部署这些模型可能带来法律和伦理风险。在美国及其他一些国家，依据合理使用原则，使用受版权保护的内容构建基础模型可能无需承担法律责任。但需注意：若模型生成与版权数据相似的内容，特别是当这些内容影响该数据市场时，合理使用原则可能不再适用于模型输出。本文强调，合理使用并非必然，仍需额外工作确保模型开发与部署完全处于合理使用范畴。首先，我们梳理了基于受版权保护内容开发与部署基础模型的潜在风险，通过回顾美国相关判例法，与现有及潜在的文本、源代码及视觉艺术生成应用建立类比分析。实验证实，主流基础模型能够生成与受版权材料高度相似的内容。其次，我们探讨了有助于基础模型符合合理使用原则的技术缓解措施，并指出需进一步研究使缓解策略与现行法律框架保持同步。最后，我们建议法律与技术缓解措施应协同演进。例如，当强效技术工具被用于降低侵权损害时，法律可结合其他政策机制更明确地考虑安全港条款。这种协同演进有助于平衡知识产权与创新需求，这与合理使用的初衷相契合。但需强调，本文所述策略并非万能灵药，仍需更多工作制定应对基础模型潜在危害的相关政策。