Despite data's central role in AI production, it remains the least understood input. As AI labs exhaust public data and turn to proprietary sources, with deals reaching hundreds of millions of dollars, research across computer science, economics, law, and policy has fragmented. We establish data economics as a coherent field through three contributions. First, we characterize data's distinctive properties -- nonrivalry, context dependence, and emergent rivalry through contamination -- and trace historical precedents for market formation in commodities such as oil and grain. Second, we present systematic documentation of AI training data deals from 2020 to 2025, revealing persistent market fragmentation, five distinct pricing mechanisms (from per-unit licensing to commissioning), and that most deals exclude original creators from compensation. Third, we propose a formal hierarchy of exchangeable data units (token, record, dataset, corpus, stream) and argue for data's explicit representation in production functions. Building on these foundations, we outline four open research problems foundational to data economics: measuring context-dependent value, balancing governance with privacy, estimating data's contribution to production, and designing mechanisms for heterogeneous, compositional goods.
翻译:尽管数据在人工智能生产中扮演核心角色,它仍是最不被理解的投入要素。随着人工智能研究室耗尽公开数据并转向专有数据源——相关交易金额已达数亿美元——计算机科学、经济学、法律和政策领域的研究呈现碎片化态势。我们通过三项贡献将数据经济学确立为一个连贯的研究领域。首先,我们刻画了数据的独特属性(非竞争性、情境依赖性和通过污染产生的涌现竞争性),并追溯了石油、谷物等大宗商品市场形成的历史先例。其次,我们系统性地记录了2020至2025年间的人工智能训练数据交易,揭示了持续的市场碎片化、五种不同的定价机制(从按单位许可到委托定制),以及大多数交易将原始创作者排除在补偿之外。第三,我们提出了一套正式的可交换数据单元层级体系(令牌、记录、数据集、语料库、数据流),并论证了数据在生产函数中应具有明确的表征。基于这些基础,我们勾勒出数据经济学中四个基础性的开放研究问题:衡量情境依赖性价值、平衡治理与隐私、估算数据对生产的贡献、以及设计针对异质性与组合性商品的机制。