Data valuation aims to quantify the usefulness of individual data sources in training machine learning (ML) models, and is a critical aspect of data-centric ML research. However, data valuation faces significant yet frequently overlooked privacy challenges despite its importance. This paper studies these challenges with a focus on KNN-Shapley, one of the most practical data valuation methods nowadays. We first emphasize the inherent privacy risks of KNN-Shapley, and demonstrate the significant technical difficulties in adapting KNN-Shapley to accommodate differential privacy (DP). To overcome these challenges, we introduce TKNN-Shapley, a refined variant of KNN-Shapley that is privacy-friendly, allowing for straightforward modifications to incorporate DP guarantee (DP-TKNN-Shapley). We show that DP-TKNN-Shapley has several advantages and offers a superior privacy-utility tradeoff compared to naively privatized KNN-Shapley in discerning data quality. Moreover, even non-private TKNN-Shapley achieves comparable performance as KNN-Shapley. Overall, our findings suggest that TKNN-Shapley is a promising alternative to KNN-Shapley, particularly for real-world applications involving sensitive data.
翻译:数据估值旨在量化单个数据源在训练机器学习(ML)模型中的有用性,是以数据为中心的ML研究的关键方面。然而,尽管数据估值至关重要,但其面临着显著却常被忽视的隐私挑战。本文以KNN-Shapley(当前最实用的数据估值方法之一)为重点研究这些挑战。我们首先强调KNN-Shapley固有的隐私风险,并论证将其适配差分隐私(DP)所面临的重大技术困难。为克服这些挑战,我们引入TKNN-Shapley——一种经过改良的隐私友好型KNN-Shapley变体,其允许直接修改以纳入DP保证(DP-TKNN-Shapley)。研究表明,与朴素隐私化的KNN-Shapley相比,DP-TKNN-Shapley在识别数据质量方面具有多项优势,并提供更优的隐私-效用权衡。此外,即使是非隐私化的TKNN-Shapley,其性能也与KNN-Shapley相当。总体而言,我们的发现表明,TKNN-Shapley是KNN-Shapley的理想替代方案,尤其适用于涉及敏感数据的实际应用场景。