Training data is the backbone of large language models (LLMs), yet today's data markets often operate under exploitative pricing -- sourcing data from marginalized groups with little pay or recognition. This paper introduces a theoretical framework for LLM data markets, modeling the strategic interactions between buyers (LLM builders) and sellers (human annotators). We begin with theoretical and empirical analysis showing how exploitative pricing drives high-quality sellers out of the market, degrading data quality and long-term model performance. Then we introduce fairshare, a pricing mechanism grounded in data valuation that quantifies each data's contribution. It aligns incentives by sustaining seller participation and optimizing utility for both buyers and sellers. Theoretically, we show that fairshare yields mutually optimal outcomes: maximizing long-term buyer utility and seller profit while sustaining market participation. Empirically when training open-source LLMs on complex NLP tasks, including math problems, medical diagnosis, and physical reasoning, fairshare boosts seller earnings and ensures a stable supply of high-quality data, while improving buyers' performance-per-dollar and long-term welfare. Our findings offer a concrete path toward fair, transparent, and economically sustainable data markets for LLM.
翻译:训练数据是大型语言模型(LLMs)的支柱,然而当今的数据市场往往在剥削性定价机制下运行——以微薄报酬或认可从边缘化群体获取数据。本文提出了一个针对LLM数据市场的理论框架,模拟了买方(LLM构建者)与卖方(人类标注者)之间的策略互动。我们首先通过理论与实证分析表明,剥削性定价如何导致高质量卖方退出市场,从而降低数据质量与长期模型性能。随后我们提出公平共享定价机制,这是一种基于数据价值评估的定价方法,能够量化每个数据点的贡献。该机制通过维持卖方参与度、优化买卖双方效用来实现激励相容。理论上,我们证明公平共享定价能产生互惠最优结果:在维持市场参与的同时,最大化买方长期效用与卖方利润。在复杂自然语言处理任务(包括数学问题、医疗诊断和物理推理)上训练开源LLMs的实证研究表明,公平共享定价显著提升卖方收益,确保持续供应高质量数据,同时改善买方的单位成本性能与长期效益。本研究为构建公平、透明且经济可持续的LLM数据市场提供了具体路径。