The entity resolution problem requires finding pairs across datasets that belong to different owners but refer to the same entity in the real world. To train and evaluate solutions (either rule-based or machine-learning-based) to the entity resolution problem, generating a ground truth dataset with entity pairs or clusters is needed. However, such a data annotation process involves humans as domain oracles to review the plaintext data for all candidate record pairs from different parties, which inevitably infringes the privacy of data owners, especially in privacy-sensitive cases like medical records. To the best of our knowledge, there is no prior work on privacy-preserving ground truth dataset generation, especially in the domain of entity resolution. We propose a novel blind annotation protocol based on homomorphic encryption that allows domain oracles to collaboratively label ground truths without sharing data in plaintext with other parties. In addition, we design a domain-specific easy-to-use language that hides the sophisticated underlying homomorphic encryption layer. Rigorous proof of the privacy guarantee is provided and our empirical experiments via an annotation simulator indicate the feasibility of our privacy-preserving protocol (f-measure on average achieves more than 90\% compared with the real ground truths).
翻译:实体解析问题需要跨数据集找出属于不同所有者但指向现实世界同一实体的记录对。为训练和评估实体解析解决方案(基于规则或机器学习),需要生成包含实体对或聚类结果的基准数据集。然而,此类数据标注过程需要人类作为领域专家审查来自多方候选记录对的明文数据,这不可避免地侵犯了数据所有者的隐私,尤其在医疗记录等隐私敏感场景中。据我们所知,目前尚无关于隐私保护基准数据集生成的研究,尤其在实体解析领域。我们提出一种基于同态加密的新型盲标注协议,该协议允许领域专家在不向其他各方共享明文数据的情况下协作标注基准数据。此外,我们设计了一种领域专用、易于使用的语言,隐藏了底层复杂的同态加密层。我们提供了隐私保证的严格证明,并通过标注模拟器的实验表明该隐私保护协议的可行性(与真实基准相比,平均F值超过90%)。