Privacy-Preserving Record linkage (PPRL) is an essential component in data integration tasks of sensitive information. The linkage quality determines the usability of combined datasets and (machine learning) applications based on them. We present a novel privacy-preserving protocol that integrates clerical review in PPRL using a multi-layer active learning process. Uncertain match candidates are reviewed on several layers by human and non-human oracles to reduce the amount of disclosed information per record and in total. Predictions are propagated back to update previous layers, resulting in an improved linkage performance for non-reviewed candidates as well. The data owners remain in control of the amount of information they share for each record. Therefore, our approach follows need-to-know and data sovereignty principles. The experimental evaluation on real-world datasets shows considerable linkage quality improvements with limited labeling effort and privacy risks.
翻译:隐私保护记录链接(PPRL)是敏感信息数据集成任务中的关键组成部分。链接质量决定了合并数据集及其(机器学习)应用的可用性。我们提出了一种新颖的隐私保护协议,该协议通过多层主动学习过程将人工审核整合到PPRL中。不确定的匹配候选项由人类与非人类仲裁者在多个层级进行审查,以减少每条记录及总体披露的信息量。预测结果会反向传播以更新先前层级,从而同时提升未审核候选项的链接性能。数据所有者始终保持对每条记录共享信息量的控制权。因此,我们的方法遵循最小知情原则与数据主权原则。在真实数据集上的实验评估表明,该方法能以有限的标注成本和隐私风险显著提升链接质量。