Removing personally identifiable information (PII) from texts is necessary to comply with various data protection regulations and to enable data sharing without compromising privacy. However, recent works show that documents sanitized by PII removal techniques are vulnerable to reconstruction attacks. Yet, we suspect that the reported success of these attacks is largely overestimated. We critically analyze the evaluation of existing attacks and find that data leakage and data contamination are not properly mitigated, leaving the question whether or not PII removal techniques truly protect privacy in real-world scenarios unaddressed. We investigate possible data sources and attack setups that avoid data leakage and conclude that only truly private data can allow us to objectively evaluate vulnerabilities in PII removal techniques. However, access to private data is heavily restricted - and for good reasons - which also means that the public research community cannot address this problem in a transparent, reproducible, and trustworthy manner.
翻译:从文本中移除个人可识别信息(PII)是遵守各类数据保护法规、实现数据共享而不损害隐私的必要手段。然而,近期研究表明,经PII移除技术处理的文档易受重建攻击。但我们怀疑这些攻击所报告的成功率在很大程度上被高估了。我们批判性地分析了现有攻击的评估方法,发现数据泄露与数据污染问题未得到妥善处理,导致PII移除技术在真实场景中是否真正保护隐私的问题悬而未决。通过研究可避免数据泄露的潜在数据源与攻击设置,我们得出结论:只有使用真实的隐私数据才能客观评估PII移除技术的脆弱性。然而,隐私数据的访问受到严格限制——这具有充分理由——这也意味着公共研究界无法以透明、可复现且可信的方式解决该问题。