Data valuation methods assign marginal utility to each data point that has contributed to the training of a machine learning model. If used directly as a payout mechanism, this creates a hidden cost of valuation, in which contributors with near-zero marginal value would receive nothing, even though their data had to be collected and assessed. To better formalize this cost, we introduce a conceptual and game-theoretic model, the Information Disclosure Game, between a Data Union (sometimes also called a data trust), a member-run agent representing contributors, and a Data Consumer (e.g., a platform). After first aggregating members' data, the DU releases information progressively by adding Laplacian noise under a differentially-private mechanism. Through simulations with strategies guided by data Shapley values and multi-armed bandit exploration, we demonstrate on a Yelp review helpfulness prediction task that data valuation inherently incurs an explicit acquisition cost and that the DU's collective disclosure policy changes how this cost is distributed across members.
翻译:数据估值方法为每个对机器学习模型训练做出贡献的数据点分配边际效用。若直接将其用作支付机制,则会产生估值的隐性成本:即使贡献者的数据必须被收集和评估,那些边际价值趋近于零的贡献者将一无所获。为更严谨地形式化这一成本,我们引入一个概念性与博弈论模型——信息披露博弈,博弈双方分别为数据联盟(有时亦称为数据信托,即代表贡献者的成员自治代理机构)与数据消费者(例如平台)。数据联盟在首先聚合成员数据后,通过差分隐私机制添加拉普拉斯噪声,逐步释放信息。借助以数据Shapley值引导策略与多臂老虎机探索的模拟实验,我们在Yelp评论有用性预测任务中证明:数据估值本质上会产生显性获取成本,且数据联盟的集体披露策略将改变该成本在成员间的分布方式。