The Economics of AI Training Data: A Research Agenda

Despite data's central role in AI production, it remains the least understood input. As AI labs exhaust public data and turn to proprietary sources, with deals reaching hundreds of millions of dollars, research across computer science, economics, law, and policy has fragmented. We establish data economics as a coherent field through three contributions. First, we characterize data's distinctive properties -- nonrivalry, context dependence, and emergent rivalry through contamination -- and trace historical precedents for market formation in commodities such as oil and grain. Second, we present systematic documentation of AI training data deals from 2020 to 2025, revealing persistent market fragmentation, five distinct pricing mechanisms (from per-unit licensing to commissioning), and that most deals exclude original creators from compensation. Third, we propose a formal hierarchy of exchangeable data units (token, record, dataset, corpus, stream) and argue for data's explicit representation in production functions. Building on these foundations, we outline four open research problems foundational to data economics: measuring context-dependent value, balancing governance with privacy, estimating data's contribution to production, and designing mechanisms for heterogeneous, compositional goods.

翻译：尽管数据在人工智能生产中占据核心地位，但其仍是最未被充分理解的投入要素。随着人工智能实验室耗尽公共数据并转向专有数据源（交易金额已达数亿美元），计算机科学、经济学、法学和政策领域的研究呈现碎片化。我们通过三项贡献将数据经济学确立为一个连贯的学科领域。首先，我们刻画了数据的独特属性——非竞争性、情境依赖性以及通过污染产生的涌现竞争性——并追溯了石油、谷物等商品市场形成的历史先例。其次，我们系统性地记录了2020年至2025年的人工智能训练数据交易，揭示了持续的市场分割、五种不同的定价机制（从按单位许可到委托定制），以及大多数交易将原始创作者排除在补偿之外的现象。第三，我们提出了一个可交换数据单元的正式层级结构（token、记录、数据集、语料库、数据流），并主张在生产函数中明确表征数据。基于这些基础，我们概述了数据经济学基础的四个开放研究问题：测量情境依赖的价值、平衡治理与隐私、估计数据对生产的贡献，以及为异质化、可组合的商品设计机制。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日