We examine machine learning models in a setup where individuals have the choice to share optional personal information with a decision-making system, as seen in modern insurance pricing models. Some users consent to their data being used whereas others object and keep their data undisclosed. In this work, we show that the decision not to share data can be considered as information in itself that should be protected to respect users' privacy. This observation raises the overlooked problem of how to ensure that users who protect their personal data do not suffer any disadvantages as a result. To address this problem, we formalize protection requirements for models which only use the information for which active user consent was obtained. This excludes implicit information contained in the decision to share data or not. We offer the first solution to this problem by proposing the notion of Protected User Consent (PUC), which we prove to be loss-optimal under our protection requirement. We observe that privacy and performance are not fundamentally at odds with each other and that it is possible for a decision maker to benefit from additional data while respecting users' consent. To learn PUC-compliant models, we devise a model-agnostic data augmentation strategy with finite sample convergence guarantees. Finally, we analyze the implications of PUC on challenging real datasets, tasks, and models.
翻译:我们研究个体可选择向决策系统分享可选个人信息的机器学习模型场景(如在现代保险费率模型中)。部分用户同意使用其数据,而另一些用户则反对并保持数据不予披露。本研究证明,不分享数据的决定本身可视为应受保护的信息,以尊重用户隐私。这一观察引发了常被忽视的问题:如何确保保护个人数据的用户不会因此遭受任何不利影响?为解决该问题,我们正式定义了仅使用已获用户主动同意的信息的模型保护要求,这排除了蕴含在是否分享数据决策中的隐式信息。我们首次提出解决方案——受保护用户同意(Protected User Consent, PUC)概念,并在保护要求下证明其具有损失最优性。研究发现隐私与性能并非根本对立,决策者完全可以在尊重用户同意的同时受益于额外数据。为学习符合PUC要求的模型,我们设计了具有有限样本收敛保证的模型无关数据增强策略。最后,我们通过具有挑战性的真实数据集、任务与模型分析了PUC的实际影响。