Many generative foundation models (or GFMs) are trained on publicly available data and use public infrastructure, but 1) may degrade the "digital commons" that they depend on, and 2) do not have processes in place to return value captured to data producers and stakeholders. Existing conceptions of data rights and protection (focusing largely on individually-owned data and associated privacy concerns) and copyright or licensing-based models offer some instructive priors, but are ill-suited for the issues that may arise from models trained on commons-based data. We outline the risks posed by GFMs and why they are relevant to the digital commons, and propose numerous governance-based solutions that include investments in standardized dataset/model disclosure and other kinds of transparency when it comes to generative models' training and capabilities, consortia-based funding for monitoring/standards/auditing organizations, requirements or norms for GFM companies to contribute high quality data to the commons, and structures for shared ownership based on individual or community provision of fine-tuning data.
翻译:许多生成式基础模型(或称GFM)依赖公开可用数据和公共基础设施进行训练,但存在两大问题:1)可能会损害其所依赖的"数字公地";2)缺乏将捕获的价值返还给数据生产者和利益相关方的流程。现有的数据权利与保护概念(主要聚焦于个人拥有数据及相关隐私问题)、基于版权或许可的模式虽提供了一些可借鉴的先例,但难以适用于基于公地数据训练的模型可能引发的问题。我们概述了GFM带来的风险及其与数字公地的关联性,并提出多项基于治理的解决方案,包括:对标准化数据集/模型披露及生成模型训练与能力方面的其他透明度要求进行投资;建立联盟式资助机制以支持监测/标准/审计组织;要求或规范GFM企业向公地贡献高质量数据;以及基于个人或社区提供的微调数据构建共享所有权结构。