Training data privacy has been a top concern in AI modeling. While methods like differentiated private learning allow data contributors to quantify acceptable privacy loss, model utility is often significantly damaged. In practice, controlled data access remains a mainstream method for protecting data privacy in many industrial and research environments. In controlled data access, authorized model builders work in a restricted environment to access sensitive data, which can fully preserve data utility with reduced risk of data leak. However, unlike differential privacy, there is no quantitative measure for individual data contributors to tell their privacy risk before participating in a machine learning task. We developed the demo prototype FT-PrivacyScore to show that it's possible to efficiently and quantitatively estimate the privacy risk of participating in a model fine-tuning task. The demo source code will be available at \url{https://github.com/RhincodonE/demo_privacy_scoring}.
翻译:训练数据隐私一直是人工智能建模中的核心关切。尽管差分隐私学习等方法允许数据贡献者量化可接受的隐私损失,但模型效用常因此显著受损。在实际应用中,受控数据访问仍是许多工业与科研环境中保护数据隐私的主流方法。在受控数据访问机制下,经授权的模型构建者在受限环境中访问敏感数据,能够在降低数据泄露风险的同时完整保留数据效用。然而,与差分隐私不同,现有方法缺乏可供个体数据贡献者在参与机器学习任务前评估自身隐私风险的量化指标。我们开发了演示原型FT-PrivacyScore,以证明高效量化评估参与模型微调任务隐私风险的可行性。该演示源代码将通过\url{https://github.com/RhincodonE/demo_privacy_scoring}公开。