Data Shapley provides a principled approach to data valuation and plays a crucial role in data-centric machine learning (ML) research. Data selection is considered a standard application of Data Shapley. However, its data selection performance has shown to be inconsistent across settings in the literature. This study aims to deepen our understanding of this phenomenon. We introduce a hypothesis testing framework and show that Data Shapley's performance can be no better than random selection without specific constraints on utility functions. We identify a class of utility functions, monotonically transformed modular functions, within which Data Shapley optimally selects data. Based on this insight, we propose a heuristic for predicting Data Shapley's effectiveness in data selection tasks. Our experiments corroborate these findings, adding new insights into when Data Shapley may or may not succeed.
翻译:数据沙普利为数据估值提供了原则性方法,并在以数据为中心的机器学习研究中发挥着关键作用。数据选择被视为数据沙普利的标准应用场景。然而,现有文献表明其数据选择性能在不同设置下存在不一致性。本研究旨在深化对这一现象的理解。我们引入假设检验框架,证明在未对效用函数施加特定约束时,数据沙普利的性能不会优于随机选择。我们识别出一类效用函数——单调变换模函数,在此类函数中数据沙普利能够实现最优数据选择。基于这一发现,我们提出预测数据沙普利在数据选择任务中有效性的启发式方法。实验结果验证了上述发现,并新增了关于数据沙普利何时有效或失效的见解。