Database de-anonymization typically involves matching an anonymized database with correlated publicly available data. Existing research focuses either on practical aspects without requiring knowledge of the data distribution yet provides limited guarantees, or on theoretical aspects assuming known distributions. This paper aims to bridge these two approaches, offering theoretical guarantees for database de-anonymization under synchronization errors and obfuscation without prior knowledge of data distribution. Using a modified replica detection algorithm and a new seeded deletion detection algorithm, we establish sufficient conditions on the database growth rate for successful matching, demonstrating a double-logarithmic seed size relative to row size is sufficient for detecting deletions in the database. Importantly, our findings indicate that these sufficient de-anonymization conditions are tight and are the same as in the distribution-aware setting, avoiding asymptotic performance loss due to unknown distributions. Finally, we evaluate the performance of our proposed algorithms through simulations, confirming their effectiveness in more practical, non-asymptotic, scenarios.
翻译:数据库去匿名化通常涉及将匿名化数据库与相关公开数据进行匹配。现有研究要么侧重于无需数据分布知识的实际应用但提供的保证有限,要么假设已知分布的理论研究。本文旨在弥合这两种方法,在同步错误和混淆情况下,无需预先知道数据分布,为数据库去匿名化提供理论保证。通过使用改进的副本检测算法和新提出的种子删除检测算法,我们建立了成功匹配所需数据库增长率的充分条件,证明相对于行大小,双对数种子大小足以检测数据库中的删除操作。重要的是,我们的发现表明这些充分的去匿名化条件是最优的,并且与已知分布设置下的条件相同,从而避免了因分布未知导致的渐近性能损失。最后,我们通过仿真实验评估了所提算法的性能,验证了它们在更实际、非渐近场景中的有效性。