We examine machine learning models in a setup where individuals have the choice to share optional personal information with a decision-making system, as seen in modern insurance pricing models. Some users consent to their data being used whereas others object and keep their data undisclosed. In this work, we show that the decision not to share data can be considered as information in itself that should be protected to respect users' privacy. This observation raises the overlooked problem of how to ensure that users who protect their personal data do not suffer any disadvantages as a result. To address this problem, we formalize protection requirements for models which only use the information for which active user consent was obtained. This excludes implicit information contained in the decision to share data or not. We offer the first solution to this problem by proposing the notion of Protected User Consent (PUC), which we prove to be loss-optimal under our protection requirement. To learn PUC-compliant models, we devise a model-agnostic data augmentation strategy with finite sample convergence guarantees. Finally, we analyze the implications of PUC on a variety of challenging real-world datasets, tasks, and models.
翻译:我们研究了一种机器学习模型,该模型允许个体自主选择是否向决策系统提供可选个人信息,这一场景常见于现代保险定价模型。部分用户同意其数据被使用,而其他用户则拒绝披露数据。研究表明,拒绝共享数据的行为本身即可被视为应受保护的信息,以尊重用户隐私。这一发现揭示了一个被忽视的问题:如何确保保护个人数据的用户不会因此处于不利地位。为解决该问题,我们形式化了仅使用已获用户主动同意的信息时的模型保护需求——这排除了从是否共享数据的决策中隐含获取的信息。我们首次提出了"受保护用户同意权"(PUC)的概念作为解决方案,并证明在保护需求下该概念具有损失最优性。为学习符合PUC的模型,我们设计了一种与模型无关的数据增强策略,并保证了有限样本下的收敛性。最后,我们分析了PUC在多种真实世界数据集、任务和模型中的应用影响。