Aiming to train and deploy predictive models, organizations collect large amounts of detailed client data, risking the exposure of private information in the event of a breach. To mitigate this, policymakers increasingly demand compliance with the data minimization (DM) principle, restricting data collection to only that data which is relevant and necessary for the task. Despite regulatory pressure, the problem of deploying machine learning models that obey DM has so far received little attention. In this work, we address this challenge in a comprehensive manner. We propose a novel vertical DM (vDM) workflow based on data generalization, which by design ensures that no full-resolution client data is collected during training and deployment of models, benefiting client privacy by reducing the attack surface in case of a breach. We formalize and study the corresponding problem of finding generalizations that both maximize data utility and minimize empirical privacy risk, which we quantify by introducing a diverse set of policy-aligned adversarial scenarios. Finally, we propose a range of baseline vDM algorithms, as well as Privacy-aware Tree (PAT), an especially effective vDM algorithm that outperforms all baselines across several settings. We plan to release our code as a publicly available library, helping advance the standardization of DM for machine learning. Overall, we believe our work can help lay the foundation for further exploration and adoption of DM principles in real-world applications.
翻译:目标在于训练和部署预测模型时,各类组织会收集大量详细的客户数据,一旦发生泄露事件,将面临个人信息暴露的风险。为缓解这一问题,政策制定者日益要求遵守数据最小化原则,即数据收集范围仅限于任务相关且必要的数据。尽管存在监管压力,目前针对部署遵守数据最小化原则的机器学习模型的研究仍鲜有关注。本文系统性应对这一挑战,提出一种基于数据泛化的新型垂直数据最小化工作流程。该流程通过设计确保在模型训练与部署过程中不收集全分辨率客户数据,通过降低泄露时的攻击面来保护客户隐私。我们形式化并研究了对应的优化问题——寻找既能最大化数据效用又能最小化经验隐私风险的泛化策略,并通过引入一系列与政策对齐的对抗场景来量化经验隐私风险。最后,我们提出多种基线垂直数据最小化算法,以及一种尤为高效的隐私感知树算法,该算法在多种设置下均优于所有基线方法。我们计划将代码作为公开库发布,以推动机器学习领域数据最小化的标准化进程。总体而言,我们认为本研究可为真实应用中数据最小化原则的进一步探索与采纳奠定基础。