Aiming to train and deploy predictive models, organizations collect large amounts of detailed client data, risking the exposure of private information in the event of a breach. To mitigate this, policymakers increasingly demand compliance with the data minimization (DM) principle, restricting data collection to only that data which is relevant and necessary for the task. Despite regulatory pressure, the problem of deploying machine learning models that obey DM has so far received little attention. In this work, we address this challenge in a comprehensive manner. We propose a novel vertical DM (vDM) workflow based on data generalization, which by design ensures that no full-resolution client data is collected during training and deployment of models, benefiting client privacy by reducing the attack surface in case of a breach. We formalize and study the corresponding problem of finding generalizations that both maximize data utility and minimize empirical privacy risk, which we quantify by introducing a diverse set of policy-aligned adversarial scenarios. Finally, we propose a range of baseline vDM algorithms, as well as Privacy-aware Tree (PAT), an especially effective vDM algorithm that outperforms all baselines across several settings. We plan to release our code as a publicly available library, helping advance the standardization of DM for machine learning. Overall, we believe our work can help lay the foundation for further exploration and adoption of DM principles in real-world applications.
翻译:旨在训练和部署预测模型时,组织机构需收集大量详细的客户数据,这增加了数据泄露时隐私信息暴露的风险。为降低此风险,政策制定者越来越要求遵循数据最小化原则,仅收集与任务相关且必要的数据。尽管面临监管压力,目前关于如何部署符合数据最小化原则的机器学习模型的研究仍十分有限。本文全面应对这一挑战,提出基于数据泛化的新型纵向数据最小化工作流程,该流程通过设计确保在模型训练和部署过程中不收集全分辨率客户数据,从而通过减少攻击面来降低数据泄露时的隐私风险。我们正式定义并研究了在数据效用最大化和经验隐私风险最小化之间寻找最优泛化策略的问题,通过引入一组多样化的政策对齐对抗场景来量化隐私风险。最后,我们提出了一系列基线纵向数据最小化算法,以及特别有效的隐私感知树算法,该算法在多个场景中均优于所有基线方法。我们计划将代码作为公开库发布,以推动机器学习领域数据最小化标准的标准化进程。总体而言,我们相信本工作可为数据最小化原则在实际应用中的进一步探索与推广奠定基础。