Reclaiming the Digital Commons: A Public Data Trust for Training Data

Democratization of AI means not only that people can freely use AI, but also that people can collectively decide how AI is to be used. In particular, collective decision-making power is required to redress the negative externalities from the development of increasingly advanced AI systems, including degradation of the digital commons and unemployment from automation. The rapid pace of AI development and deployment currently leaves little room for this power. Monopolized in the hands of private corporations, the development of the most capable foundation models has proceeded largely without public input. There is currently no implemented mechanism for ensuring that the economic value generated by such models is redistributed to account for their negative externalities. The citizens that have generated the data necessary to train models do not have input on how their data are to be used. In this work, we propose that a public data trust assert control over training data for foundation models. In particular, this trust should scrape the internet as a digital commons, to license to commercial model developers for a percentage cut of revenues from deployment. First, we argue in detail for the existence of such a trust. We also discuss feasibility and potential risks. Second, we detail a number of ways for a data trust to incentivize model developers to use training data only from the trust. We propose a mix of verification mechanisms, potential regulatory action, and positive incentives. We conclude by highlighting other potential benefits of our proposed data trust and connecting our work to ongoing efforts in data and compute governance.

翻译：人工智能的民主化不仅意味着人们可以自由使用AI，更意味着人们能够集体决定AI的使用方式。特别是，需要集体决策权来纠正日益先进的AI系统发展所带来的负外部性，包括数字公地退化与自动化导致的失业问题。当前AI开发和部署的快速步伐几乎没有为这种权力留下空间。最强大的基础模型的开发被垄断在私营企业手中，基本未纳入公众意见。目前尚无已实施的机制来确保这些模型产生的经济价值能够被重新分配，以弥补其负外部性。生成训练数据所需的公民未能对其数据的使用方式发表意见。在此工作中，我们提出由一个公共数据信托机构对基础模型的训练数据行使控制权。具体而言，该信托应作为数字公地 scraping 互联网，向商业模型开发者提供授权，并从模型部署收入中抽取一定比例的分成。首先，我们详细论证了建立此类信托的必要性，并讨论了其可行性与潜在风险。其次，我们阐述了数据信托激励模型开发者仅从该信托获取训练数据的多种方式，提出了一种融合验证机制、潜在监管措施与正向激励的综合方案。最后，我们总结了所提数据信托的其他潜在优势，并将我们的工作与数据与计算治理领域的现有努力联系起来。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

105+阅读 · 2022年2月10日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日