There has recently been an increased scientific interest in the de-anonymization of users in anonymized databases containing user-level microdata via multifarious matching strategies utilizing publicly available correlated data. Existing literature has either emphasized practical aspects where underlying data distribution is not required, with limited or no theoretical guarantees, or theoretical aspects with the assumption of complete availability of underlying distributions. In this work, we take a step towards reconciling these two lines of work by providing theoretical guarantees for the de-anonymization of random correlated databases without prior knowledge of data distribution. Motivated by time-indexed microdata, we consider database de-anonymization under both synchronization errors (column repetitions) and obfuscation (noise). By modifying the previously used replica detection algorithm to accommodate for the unknown underlying distribution, proposing a new seeded deletion detection algorithm, and employing statistical and information-theoretic tools, we derive sufficient conditions on the database growth rate for successful matching. Our findings demonstrate that a double-logarithmic seed size relative to row size ensures successful deletion detection. More importantly, we show that the derived sufficient conditions are the same as in the distribution-aware setting, negating any asymptotic loss of performance due to unknown underlying distributions.
翻译:近年来,通过利用公开相关数据采用多种匹配策略对包含用户级微数据的匿名化数据库进行用户去匿名化,已成为科学界日益关注的焦点。现有文献要么侧重实践方面(不依赖底层数据分布,但理论保证有限或缺失),要么在假设完全掌握底层分布的前提下进行理论分析。本研究致力于弥合这两条研究路线,为无先验数据分布知识的随机相关数据库去匿名化提供理论保障。受时间索引微数据启发,我们考虑了同时存在同步错误(列重复)与混淆(噪声)时的数据库去匿名化问题。通过改进先前使用的复制检测算法以适应未知分布,提出新型种子删除检测算法,并运用统计与信息论工具,我们推导出成功匹配所需的数据库增长率充分条件。研究表明,种子大小与行大小呈双对数关系即可确保成功删除检测。更重要的是,我们证明了所推导的充分条件与分布已知场景下的条件相同,即未知底层分布不会导致任何渐近性能损失。