Labeling without Seeing? Blind Annotation for Privacy-Preserving Entity Resolution

The entity resolution problem requires finding pairs across datasets that belong to different owners but refer to the same entity in the real world. To train and evaluate solutions (either rule-based or machine-learning-based) to the entity resolution problem, generating a ground truth dataset with entity pairs or clusters is needed. However, such a data annotation process involves humans as domain oracles to review the plaintext data for all candidate record pairs from different parties, which inevitably infringes the privacy of data owners, especially in privacy-sensitive cases like medical records. To the best of our knowledge, there is no prior work on privacy-preserving ground truth dataset generation, especially in the domain of entity resolution. We propose a novel blind annotation protocol based on homomorphic encryption that allows domain oracles to collaboratively label ground truths without sharing data in plaintext with other parties. In addition, we design a domain-specific easy-to-use language that hides the sophisticated underlying homomorphic encryption layer. Rigorous proof of the privacy guarantee is provided and our empirical experiments via an annotation simulator indicate the feasibility of our privacy-preserving protocol (f-measure on average achieves more than 90\% compared with the real ground truths).

翻译：实体解析问题需要跨数据集找出属于不同所有者但指向现实世界同一实体的记录对。为训练和评估实体解析解决方案（基于规则或机器学习），需要生成包含实体对或聚类结果的基准数据集。然而，此类数据标注过程需要人类作为领域专家审查来自多方候选记录对的明文数据，这不可避免地侵犯了数据所有者的隐私，尤其在医疗记录等隐私敏感场景中。据我们所知，目前尚无关于隐私保护基准数据集生成的研究，尤其在实体解析领域。我们提出一种基于同态加密的新型盲标注协议，该协议允许领域专家在不向其他各方共享明文数据的情况下协作标注基准数据。此外，我们设计了一种领域专用、易于使用的语言，隐藏了底层复杂的同态加密层。我们提供了隐私保证的严格证明，并通过标注模拟器的实验表明该隐私保护协议的可行性（与真实基准相比，平均F值超过90%）。

相关内容

实体解析

关注 5

不同的数据提供方对同一个事物即实体 (Entity)可能会有不同的描述 (这里的描述包括数据格式、表示方法等) ，每一个对实体的描述称为该实体的一个引用。实体解析，是指从一个“ 引用集合”中解析并映射到现实世界中的“ 实体”过程。实体解析(Entity Resolution)又被称为记录链接(Record Linkage) 、对象识别(object Identification ) 、个体识别(Individual Identification) 、重复检测(Duplicate Detection)

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日