Database Matching Under Adversarial Column Deletions

The de-anonymization of users from anonymized microdata through matching or aligning with publicly-available correlated databases has been of scientific interest recently. While most of the rigorous analyses of database matching have focused on random-distortion models, the adversarial-distortion models have been wanting in the relevant literature. In this work, motivated by synchronization errors in the sampling of time-indexed microdata, matching (alignment) of random databases under adversarial column deletions is investigated. It is assumed that a constrained adversary, which observes the anonymized database, can delete up to a $\delta$ fraction of the columns (attributes) to hinder matching and preserve privacy. Column histograms of the two databases are utilized as permutation-invariant features to detect the column deletion pattern chosen by the adversary. The detection of the column deletion pattern is then followed by an exact row (user) matching scheme. The worst-case analysis of this two-phase scheme yields a sufficient condition for the successful matching of the two databases, under the near-perfect recovery condition. A more detailed investigation of the error probability leads to a tight necessary condition on the database growth rate, and in turn, to a single-letter characterization of the adversarial matching capacity. This adversarial matching capacity is shown to be significantly lower than the random matching capacity, where the column deletions occur randomly. Overall, our results analytically demonstrate the privacy-wise advantages of adversarial mechanisms over random ones during the publication of anonymized time-indexed data.

翻译：从匿名化微观数据中通过匹配或对齐公开可获取的相关数据库来去匿名化用户，近年引起科学界关注。尽管大多数数据库匹配的严格分析均聚焦于随机失真模型，但相关文献中对抗性失真模型的研究仍显不足。本文受时间索引微观数据采样过程中同步误差的启发，研究了对抗性列删除下随机数据库的匹配（对齐）问题。我们假设一个可观测匿名化数据库的受限对手，能够删除最多$\delta$比例的列（属性）以阻碍匹配并保护隐私。利用两个数据库的列直方图作为排列不变特征，检测对手选择的列删除模式。列删除模式检测后，再执行精确的行（用户）匹配方案。该两阶段方案在最坏情况下的分析，得出了在近完美恢复条件下成功匹配两个数据库的充分条件。对错误概率的进一步详细研究，导出了数据库增长率的紧致必要条件，进而得到对抗性匹配容量的单字母刻画。该对抗性匹配容量明显低于列随机删除时的随机匹配容量。总体而言，我们的结果从理论上证明了在匿名化时间索引数据发布过程中，对抗性机制相较于随机机制具有隐私保护优势。