Data valuation, a critical aspect of data-centric ML research, aims to quantify the usefulness of individual data sources in training machine learning (ML) models. However, data valuation faces significant yet frequently overlooked privacy challenges despite its importance. This paper studies these challenges with a focus on KNN-Shapley, one of the most practical data valuation methods nowadays. We first emphasize the inherent privacy risks of KNN-Shapley, and demonstrate the significant technical difficulties in adapting KNN-Shapley to accommodate differential privacy (DP). To overcome these challenges, we introduce TKNN-Shapley, a refined variant of KNN-Shapley that is privacy-friendly, allowing for straightforward modifications to incorporate DP guarantee (DP-TKNN-Shapley). We show that DP-TKNN-Shapley has several advantages and offers a superior privacy-utility tradeoff compared to naively privatized KNN-Shapley in discerning data quality. Moreover, even non-private TKNN-Shapley achieves comparable performance as KNN-Shapley. Overall, our findings suggest that TKNN-Shapley is a promising alternative to KNN-Shapley, particularly for real-world applications involving sensitive data.
翻译:数据估值作为以数据为中心的机器学习研究的关键方面,旨在量化训练机器学习模型时单个数据源的有用性。然而,尽管数据估值具有重要性,但它面临着显著且常被忽视的隐私挑战。本文以当前最实用的数据估值方法之一KNN-Shapley为重点,研究了这些挑战。我们首先强调了KNN-Shapley固有的隐私风险,并展示了将其适配为满足差分隐私(DP)要求的重大技术困难。为克服这些挑战,我们提出了TKNN-Shapley,这是KNN-Shapley的一种精炼变体,具有隐私友好特性,允许通过简单修改来纳入DP保障(DP-TKNN-Shapley)。我们证明,与朴素私有化的KNN-Shapley相比,DP-TKNN-Shapley在辨别数据质量方面具有多项优势,并提供更优的隐私-效用权衡。此外,即使非私有的TKNN-Shapley也能达到与KNN-Shapley相当的性能。总体而言,我们的研究结果表明,TKNN-Shapley是KNN-Shapley的一种有前途的替代方案,尤其适用于涉及敏感数据的实际应用场景。