Reclaiming the Digital Commons: A Public Data Trust for Training Data

Democratization of AI means not only that people can freely use AI, but also that people can collectively decide how AI is to be used. In particular, collective decision-making power is required to redress the negative externalities from the development of increasingly advanced AI systems, including degradation of the digital commons and unemployment from automation. The rapid pace of AI development and deployment currently leaves little room for this power. Monopolized in the hands of private corporations, the development of the most capable foundation models has proceeded largely without public input. There is currently no implemented mechanism for ensuring that the economic value generated by such models is redistributed to account for their negative externalities. The citizens that have generated the data necessary to train models do not have input on how their data are to be used. In this work, we propose that a public data trust assert control over training data for foundation models. In particular, this trust should scrape the internet as a digital commons, to license to commercial model developers for a percentage cut of revenues from deployment. First, we argue in detail for the existence of such a trust. We also discuss feasibility and potential risks. Second, we detail a number of ways for a data trust to incentivize model developers to use training data only from the trust. We propose a mix of verification mechanisms, potential regulatory action, and positive incentives. We conclude by highlighting other potential benefits of our proposed data trust and connecting our work to ongoing efforts in data and compute governance.

翻译：人工智能的民主化不仅意味着人们可以自由使用人工智能，更意味着人们能够集体决定人工智能的使用方式。特别是，为了纠正日益先进的人工智能系统发展所带来的负外部性——包括数字公地退化与自动化导致的失业——我们必须拥有集体决策权。当前人工智能开发与部署的快速节奏几乎未为这种权力留下空间。由于最强大的基础模型开发被垄断于私营企业手中，公众几乎无从参与。目前尚未建立有效的机制，确保这些模型产生的经济价值在分配时能弥补其负外部性。为模型训练提供必要数据的公民，对其数据的使用方式没有发言权。本文提出，由公共数据信托对基础模型的训练数据行使控制权。具体而言，该信托应将互联网作为数字公地进行抓取，并向商业模型开发者授权使用，从部署收入中抽取一定比例的分成。首先，我们详细论证了此类信托存在的必要性，并讨论了其可行性与潜在风险。其次，我们阐述了数据信托激励模型开发者仅从该信托获取训练数据的多种方式，提出了一套结合验证机制、潜在监管措施与正向激励的综合方案。最后，我们强调了所提数据信托的其他潜在优势，并将本研究与当前数据与算力治理的持续工作相衔接。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/