Machine learning (ML) algorithms rely primarily on the availability of training data, and, depending on the domain, these data may include sensitive information about the data providers, thus leading to significant privacy issues. Differential privacy (DP) is the predominant solution for privacy-preserving ML, and the local model of DP is the preferred choice when the server or the data collector are not trusted. Recent experimental studies have shown that local DP can impact ML prediction for different subgroups of individuals, thus affecting fair decision-making. However, the results are conflicting in the sense that some studies show a positive impact of privacy on fairness while others show a negative one. In this work, we conduct a systematic and formal study of the effect of local DP on fairness. Specifically, we perform a quantitative study of how the fairness of the decisions made by the ML model changes under local DP for different levels of privacy and data distributions. In particular, we provide bounds in terms of the joint distributions and the privacy level, delimiting the extent to which local DP can impact the fairness of the model. We characterize the cases in which privacy reduces discrimination and those with the opposite effect. We validate our theoretical findings on synthetic and real-world datasets. Our results are preliminary in the sense that, for now, we study only the case of one sensitive attribute, and only statistical disparity, conditional statistical disparity, and equal opportunity difference.
翻译:机器学习(ML)算法主要依赖于训练数据的可用性,而根据应用领域的不同,这些数据可能包含数据提供者的敏感信息,从而引发严重的隐私问题。差分隐私(DP)是隐私保护机器学习的主流解决方案,其中本地差分隐私模型在服务器或数据收集方不受信任时成为首选方案。近期的实验研究表明,局部差分隐私可能对不同人群子组的机器学习预测产生影响,进而影响决策的公平性。然而,现有研究结论存在矛盾:部分研究表明隐私保护对公平性具有积极影响,而另一些研究则显示消极影响。本研究对局部差分隐私影响公平性的机制进行了系统化与形式化分析。具体而言,我们定量研究了在不同隐私级别和数据分布条件下,机器学习模型决策的公平性如何随局部差分隐私的实施而变化。我们特别通过联合分布与隐私级别的边界条件,界定了局部差分隐私对模型公平性产生影响的程度范围。本研究系统刻画了隐私保护降低歧视的适用场景与产生相反效果的条件。我们在合成数据集与真实数据集上验证了理论发现。当前研究结果的初步性体现在:目前仅考察单一敏感属性的情况,且仅涉及统计差异、条件统计差异与机会均等差异三类公平性度量指标。